We now move to data structures. In Chapter 3 the focus was on individual numbers, booleans, strings and date/times. Data structures hold multiple values. In R you can work the following data structures:

4.1 Vectors

A vector is a one dimensional data structure: it has one row and one or more columns. In case there is only one column, an R vector holds one number, one character variable, one logical or one data/time value. In other words, in the previous chapter, we actually used vectors. A vector, like a matrix or an array, is homogeneous: is allows you to store one type of variable (e.g. numeric, character, …).

4.1.1 Creating a vector: basics

To create a vector we use the c() function to combine the elements within this function in one data structure. Let’s create a vector with numbers:

vec_num <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
vec_num
 [1]  0  1  1  2  3  5  8 13 21 34

You can see the total number of elements in this vector using the length() function:

length(vec_num)
[1] 10

Here, we nave a total of 10 columns. You can see in the environment pane that the vector is a 1x10 vector: 1 row and 10 columns.

If you check the type of this vector, you’ll see that its type is “double”. In other words, it is a numeric vector.

typeof(vec_num)
[1] "double"

You can check if an object is a vector using

is.vector(vec_num)
[1] TRUE

In addition, you can check if a vector is of a given type by including the mode (numeric, logical, …) in is.vector():

is.vector(vec_num, mode = "numeric")
[1] TRUE

The type of vec_num is “double”. You can create a vector with other data types:

vec_char <- c("cat", "mouse", "dog", "bird")
vec_log <- c(TRUE, FALSE, TRUE, TRUE)
vec_int <- c(1L, 10L, 50L)
vec_dat <- c(as.POSIXct("2025-03-25"), as.POSIXct("2025-04-25"))

You can check the type of the data stored in all these vectors:

typeof(vec_char)
[1] "character"
typeof(vec_log)
[1] "logical"
typeof(vec_int)
[1] "integer"
typeof(vec_dat)
[1] "double"

The type of a vector is the type common to all individual elements. In other words, a vector only holds elements of the same type. If this is not the case, R will change the type of all elements in the vector to a type that fits all. This is also called implicit coercion: R chooses the type for that data that fits all components of the data structure. For a vector, this means that all values in the columns will have the same type.

For instance, suppose that we have a vector

vec_1 <- c(1, "2", 3)

Here, we mix two numeric values with 1 character value “2”. If you take a look at this vector, you’ll see that R changes all elements in characters:

vec_1
[1] "1" "2" "3"

You can verify this by checking the type

typeof(vec_1)
[1] "character"

As you can see, vec_1 is not a numeric vector, but a character vector. Let’s take another example:

vec_2 <- c(TRUE, FALSE, 5, as.POSIXct("2025-03-25"))
vec_2
[1]          1          0          5 1742857200

In this example, we have a mix of logical values (TRUE, FALSE), a numeric value and a date/time value. R uses a common type and sets TRUE equal to 1, FALSE equal to 0 and show the number of seconds since January 1, 1970. In other words, R implicitly coerces the vector into a double vector.

typeof(vec_2)
[1] "double"

Let’s see what happens if we mix logical, character and numeric values:

vec_3 <- c(TRUE, FALSE, "a", 5)
vec_3
[1] "TRUE"  "FALSE" "a"     "5"    

Here, from the quotation marks, you can see that R changes the type of all individual elements into character values. These three examples are examples of implicit coercion: R tries to find a way to represent the elements in a vector using a common type. Sometimes, this implicit coercion makes sense, sometimes it doesn’t. For instance, combining a numeric value and a character representation of a numeric value creates a character vector. The reason why R changes numbers into characters is that usually, you can represent a number as a character, while you can not always represent a character as a number. In a similar way, because you can represent a logical value in a number, but a number not always in a logical value - unless that number happens to be 0 or 1, R will set the type of a vector that includes both logical and numeric values in numeric. The same holds for the mixture of date/time and logical, data/time and numeric and data/time, logical and numeric.

You can coerce the type of a vector using an as. function: as.numeric(), as.integer(), as.character(), as.logical() or as.Date() or as.POSIXct(). Here, the coercion is explicit. In that case, R will try to change all elements into the same type. In case this is impossible, R produces NA’s. For instance, let’s try to change the three vectors vec_1, vec_2 and vec_3 in numeric:

as.numeric(vec_1)
[1] 1 2 3
as.numeric(vec_2)
[1]          1          0          5 1742857200
as.numeric(vec_3)
Warning: NAs introduced by coercion
[1] NA NA NA  5

For vec_1and vec_2 R could change all the elements in type numeric: in vec_1 R managed to change the character “2” in a number 2. The same holds for vec_2. Here R could change the type of TRUE, FALSE in 1 and 0 and set the date/time variable in numeric format. For vec_3, changing all elements in numeric was impossible. As a matter of fact, with the exception of the number 5, R didn’t manage to change the type at all. Why couldn’t R change “TRUE” or “FALSE” in 1 and 0 as it could in vec_2. Here, TRUE and FALSE were character values, not boolean. When vec_3 was created, R changed the type of all its values in “character”. In other words, as far as R is concerned, TRUE became “TRUE” and R doesn’t keep track of the path that led it to “TRUE”. In other words, R doesn’t recall changing TRUE into TRUE. Because of this, R didn’t manage to change the “TRUE” (back )into a boolean TRUE from there into a number. As this was not possible, it replaced that value with an NA.

You can change an object (e.g. a column in a data frame) into a vector using the as.vector() function. This function takes two arguments: the object that you want to convert into a vector and the vector type. For instance

vec_4 <- as.vector(vec_1, mode = "numeric")
vec_4
[1] 1 2 3

creates a numeric vector from vec_1. Note that here this operation was not as useful as vec_1 is a numeric vector. However, in later chapters we will convert variables or column in a date frame in vectors. To do so, we will often have to be explicit in the mode. Leaving out the mode, R will copy the type of e.g. the column in a data frame into the mode.

So far, all vectors were created using c() including all elements one for one in this function. Using the vector(type, length = ) function, you can create an empty vector of a given length and type. For instance, to create an empty numeric vector of length 10:

vec_1 <- vector("numeric", length = 10)
vec_1
 [1] 0 0 0 0 0 0 0 0 0 0

As you can see, this vector is filled with 0. Note that this is a numeric vector but only for now. If you would change one of its elements in a character, the full vector would change from numeric into character. If you want to create an empty character vector:

vec_2 <- vector("character", length = 10)
vec_2
 [1] "" "" "" "" "" "" "" "" "" ""

Here, you can see that empty is a space (recall that a space of a character).

Creating a vector with “0” values can be very useful before a for loop. Suppose that you have a for loop where each ‘loop’ adds the result of a calculation to a vector. Here, you have two option. First, you allow the vector to ‘grow’ in every loop. Second, you define an empty vector with the same length as the number of loops and you fill each element as you run through the loop. The first option is not very efficient as R will copy the entire vector you have each time you expand it with one element. This is not the case if you create the vector before the loop. Here, R fills one element after the other but doesn’t need to grow the vector.

Recall that NA are missing observations. If a vector includes NA values, that will not change the vector’s type. To see this, let’s create two vectors, one numeric and one character, which both include NA and show their type:

vec_1 <- c(10, 30, NA, 40)
vec_2 <- c("dog", NA, "cat")
typeof(vec_1)
[1] "double"
typeof(vec_2)
[1] "character"

The same hold for NaN (not a number) and Inf (infinity). If these and NA are part of a character vector, they will become character values “NA”, “Nan” of “Inf”. In other words, they’ll be considered characters and not special values.

Create a numeric vector with 5 columns, 1, 2, 3, 4 and 5. Assign this vector to vec_yt1

Code
vec_yt1 <- c(1, 2, 3, 4, 5)

Check the type of this vector

Code
typeof(vec_yt1)
[1] "double"

Create a new vector, vec_yt2 with values TRUE, FALSE, TRUE, TRUE, FALSE and check the class and type of this vector

Code
vec_yt2 <- c(T, F, T, T, F)
class(vec_yt2)
[1] "logical"
Code
typeof(vec_yt2)
[1] "logical"

Determine the length of the vector vec_yt1.

Code
length(vec_yt1)
[1] 5

Create a character vector vec_yt3 whose elements include: south, west, east, north.

Code
vec_yt3 <- c("south", "west", "east", "north")

Determine the length of this vector and the number of characters

Code
length(vec_yt3)
[1] 4
Code
nchar(vec_yt3)
[1] 5 4 4 5

Can you store the number of characters in a new vector vec_yt3n?

Code
vec_yt3n <- nchar(vec_yt3)

4.1.2 Named vectors

You can define names for the columns of a vector. You can do so when you create the vector using the c() function or the setNames() function, or, at a later sage, using the names() functions. Suppose that you have a vector with exam results for three courses, A, B and C. Using a named vector, allows you to identify the columns:

vec_1 <- c(A = 15, B = 13, C = 17)

The vector now includes column names. You can see that this is the case in the environment pane where vec_1 is now identified as a Named num [1:3]. These columns are also included if you ask R to show the vector:

vec_1
 A  B  C 
15 13 17 

There are other ways to add names. Using setNames() you can define both the vector as well as the names. Using the previous example:

vec_2 <- setNames(c(15, 13, 17), c("A", "B", "C"))

In a final example, we’ll use the names() function to add names after the vector was created. Let’s first create a vector:

vec_3 <- c(15, 13, 17)

To add names, we include them in a another vector and use names() to assign names to vec_3:

names(vec_3) <- c("A", "B", "C")
vec_3
 A  B  C 
15 13 17 

The names function adds an attribute to the vector. To see this, let’s check the attributes of vec_3:

attributes(vec_3)
$names
[1] "A" "B" "C"

You can also use the names() function to extract the names of a vector:

var_names <- names(vec_1)
var_names
[1] "A" "B" "C"

Here, R checks the attributes of the vector vec_1 and copies the names of the variables to var_names. As an alternative, you could have done the same using

attributes(vec_1)$names
[1] "A" "B" "C"

Here, R reads the attributes of vec_1 and extracts the names of the columns.

Extracting the names allows you to store these names in a character vector that you can use in your work flow. With many columns, you can see the names using e.g. str():

str(vec_1)
 Named num [1:3] 15 13 17
 - attr(*, "names")= chr [1:3] "A" "B" "C"

Here, too, you can see that names are defined as an attribute.

To remove the names of columns, you can use unname(obj, force = FALSE). The first arguments is the object (e.g. vector) whose names you want to remove; the second is a specific option to remove names even if the object is a data frame. You can usually keep the default value FALSE.

vec_3 <- unname(vec_3)
vec_3
[1] 15 13 17

For the vector vec_yt1 with elements 1, 2, 3: add names A, B and C to this vector. To this in three ways.

  • Add the names as you create the vector: option 1
Code
vec_yt1 <- c(A = 1, B = 2, C = 3)
vec_yt1
A B C 
1 2 3 
  • Add the names as you create the vector: option 2:
Code
vec_yt1 <- setNames(c(1, 2, 3), c("A", "B", "C"))
vec_yt1
A B C 
1 2 3 
  • Add the names after you have create the vector:
Code
unname(vec_yt1 <- c(1, 2, 3))
[1] 1 2 3
Code
names(vec_yt1) <- c("A", "B", "C")

# Note that you can use setNames() as well

setNames(vec_yt1, c("A", "B", "C"))
A B C 
1 2 3 
Code
vec_yt1
A B C 
1 2 3 

Check the attributes of vec_yt1:

Code
attributes(vec_yt1)
$names
[1] "A" "B" "C"

Extract the names of vec_yt1 and store them in a vector vec_yt1_names

Code
# Option 1
vec_yt1_names <- names(vec_yt1)

# Option 2
vec_yt1_names <- attributes(vec_yt1)$names

Would the following code work to remove the names from vec_yt1? If not, how can you remove the names?

unname(vec_yt1)
attributes(vec_yt1)

Does it work? I you don’t think so, check:

Code
vec_yt1 <- unname(vec_yt1)
attributes(vec_yt1)
NULL

4.1.3 Creating a vector: replicating elements of another vector

Using rep(x, times, length.out, each) you can replicate the values in a vector x. Suppose you want a vector where all elements repeat a value 10 times. The first argument is the values yo want to replicate. This can be any value: number, character, a vector … . The second to last arguments determine how many times or how x needs to be replicated. To create a vector with 10 columns and all values equal to 25

vec_rep <- rep(x = 25, times = 10)

Here, x was a number, but you can also replicate characters or other vectors:

vec_rep_char <- rep(x = "ABC", times = 5)
vec_rep_vec <- rep(x = c(1, 2, 3), times = 5)
vec_rep_char
[1] "ABC" "ABC" "ABC" "ABC" "ABC"
vec_rep_vec
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

length.out sets the length of the vector. If x is a single numeric, character, date/time using length.out and times is equivalent. If x is a vector, this is not the case. In the previous example, c(1, 2, 3) was replicated 5 times. In other words, the length of the output vector was 15. Using length.out you can set the total length. In doing so, R will replicate the vector, but will do so only partially on the last replication. For instance, if you set the length.out = 10, the length of the output vector is 10:

vec_rep_vec <- rep(x = c(1, 2, 3), length.out = 10)
vec_rep_vec
 [1] 1 2 3 1 2 3 1 2 3 1

With times and length.out you replicate the full vector on every replication. Using each you replicate each element of the vector each times. In other words, the output vector will show the first element of the input vector each times before it changes to the second element of the input vector.

vec_rep_vec <- rep(x = c(1, 2, 3), each = 3)
vec_rep_vec
[1] 1 1 1 2 2 2 3 3 3

Adding length.out sets a limit on the total length of the output vector. It does so by reducing the number of replications of the last element in the input vector:

vec_rep_vec <- rep(x = c(1, 2, 3), length.out = 7, each = 3)
vec_rep_vec
[1] 1 1 1 2 2 2 3

4.1.4 Functions that generate a vector

There are a number of functions that produce a vector. These can be grouped into functions that generate a sequence, function that generate a vector with random numbers, vectors that are created by sampling another vector and vectors as a result of set operations.

4.1.4.1 Generating a sequence

To create a vector, we used c() and included all its values. Some functions allow you to create a special vector. seq() allows you to fill a vector with a sequence of numbers. To do so, this function requires a start point (from), and endpoint (to) and either the increment of the sequence (by) or the length of the sequence (length.out). If length.out is specified, then R calculates the increments of the sequence. To see how this works, let’s create a vector which holds a sequence starting at 1, ending at 10 in steps of 1:

vec_1 <- seq(from = 1, to = 10, by = 1)
vec_1
 [1]  1  2  3  4  5  6  7  8  9 10

As an alternative, we can create the same sequence using the length.out argument:

vec_2 <- seq(1, 10, length.out = 10)
vec_2
 [1]  1  2  3  4  5  6  7  8  9 10

What happens if the last increment of the sequence, starting from the starting position, doesn’t end in the value given in the by argument. In that case, seq() stops before the value in to is reached. For instance:

vec_3 <- seq(1, 10, by = 8)
vec_3
[1] 1 9

Using the length.out = argument, the sequence always end in the value in to. That is so because R determines the increment using equally spaced intervals between from and to using length.out. You can use len or length as an alternative for length.out. As an example:

vec_3 <- seq(1, 10, length.out = 25)
vec_3
 [1]  1.000  1.375  1.750  2.125  2.500  2.875  3.250  3.625  4.000  4.375
[11]  4.750  5.125  5.500  5.875  6.250  6.625  7.000  7.375  7.750  8.125
[21]  8.500  8.875  9.250  9.625 10.000

You can also use seq() with from, by and length.out. Here, you don’t specify the last value of the sequence. R will generate a sequence starting from the value in from and it will add the value in by length.out times. Note that in this case, you need to add the arguments of the function as you skip the second argument.

vec_4 <- seq(from = 10, by = 10, length.out = 25)
vec_4
 [1]  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190
[20] 200 210 220 230 240 250

Note that the increment can be negative. In that case, R will reduce the start value with the value of the increment until it reaches the end value or until is reaches the number of increments in length.out:

vec_5 <- seq(from = 100, to = 50, by = -10)
vec_5
[1] 100  90  80  70  60  50

or, as an alternative

vec_6 <- seq(from  = 100, by = -10, length.out = 5)
vec_6
[1] 100  90  80  70  60

In specific cases, you can create a sequence using shorter notation. For instance, suppose you want a vector of integers, where each increment is exactly 1. To generate this sequence, you can use

vec_7 <- 21:30
vec_7
 [1] 21 22 23 24 25 26 27 28 29 30

We will use this short way to writing a sequence often in a for loop:

i <- 1
for (i in 1:5) {
  print("Hello World")
}
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"

Here, i will adopt each value in 1:5, i.e. 1, 2, 3, 4 and 5 and print Hello World as long as i is smaller than or equal to 5. The counter i starts with a value 1 and the counter increases by 1 after every print of Hello World.

If the starting position is 1, sec.len() can be used as well:

vec_8 <- seq_len(10)
vec_8
 [1]  1  2  3  4  5  6  7  8  9 10

You can use seq.Date() to generate a sequence of dates. The arguments of this function are very similar to those for the seq() function. As a matter of fact, if you would use seq() and not seq.Date() R would recognize that you are using seq() to generate a sequence of dates and would use sec.Date() without problem. The from argument is the start date, the to the end date. If you use to, you need to specify the increment. Here, you can use “day”, “week”, “month”, “quarter” or “year”. Note that “days”, “weeks”, “months”, “quarters” or “years” is also accepted. If you add an integer, R will increment with a a multiple of “days”, … . To illustrate, let’s create three vectors, all start on January 1, 2025 and end on December 31, 2025. The first increments in days, the second in 3 weeks and the last in quarters:

start_d <- as.Date("2025-01-01")
end_d <- as.Date("2025-12-31")

vec_d <- seq.Date(from = start_d, to = end_d, by = "day")
vec_w <- seq.Date(from = start_d, to = end_d, by = "3 weeks")
vec_q <- seq.Date(from = start_d, to = end_d, by = "quarter")

R generates a sequence and ends the sequence before the date in to. To see this, let’s ask the maximum value in each of these vectors:

max(vec_d)
[1] "2025-12-31"
max(vec_w)
[1] "2025-12-24"
max(vec_q)
[1] "2025-10-01"

If you increment with “day”, the last date is 2025-12-31. However, in both other cases, the last value of the sequence is before 2025-31-12. Using length_out, you determine the length of the sequence, but you allow R to determine the size of the increment if you include a value for to for the end point:

vec_d10 <- seq.Date(from = start_d, to  = end_d, length.out = 10)
vec_d10
 [1] "2025-01-01" "2025-02-10" "2025-03-22" "2025-05-02" "2025-06-11"
 [6] "2025-07-22" "2025-08-31" "2025-10-11" "2025-11-20" "2025-12-31"

If you combine a value for both by and length.out R will determine the end date. For instance, if you use 2025-01-01 as your start day, and increment 10 times with 1 week, R will produce:

vec_w10 <- seq.Date(from = start_d, by = "weeks", length.out = 10)
vec_w10
 [1] "2025-01-01" "2025-01-08" "2025-01-15" "2025-01-22" "2025-01-29"
 [6] "2025-02-05" "2025-02-12" "2025-02-19" "2025-02-26" "2025-03-05"

As you would with seq() you can also use negative increments. In that case, R will count backwards in time. For instance, the generate a sequence starting on 2025-31-12 and ending at or before 2025-01-01 and steps of 5 weeks:

vec_db <- seq.Date(end_d, start_d, by = "-5 weeks")
vec_db
 [1] "2025-12-31" "2025-11-26" "2025-10-22" "2025-09-17" "2025-08-13"
 [6] "2025-07-09" "2025-06-04" "2025-04-30" "2025-03-26" "2025-02-19"
[11] "2025-01-15"

Using seq.POSIXt you can generate date/time values. As was the case with seq.Date(), you can enter a starting date/time in the from argument, and end date/time in the to argument and supply the function with an increment “sec”, “min”, “hour”, “day”, “DSTday”, “week”, “month”, “quarter” of “year”. If you add an “s” that will not cause an error. In other words, R know the day is equal to days. In addition, you can add an integer to increment in multiples of “sec”. The difference between “day” and “DSTday” has to to be daylight savings time. DSTday takes daylight savings time into account. Is you include from, to and length.out, R determines the increment. With from, by and length.out R generates a sequence by adding the increment in by as many times and determined in length_out. If the time zone is not UTC, it has to be specified in from. Here are a couple of examples:

start_d <- as.POSIXct("2025-01-01 12:00:00")
end_d <- as.POSIXct("2025-01-05 12:00:00")

vec_dt_hour <- seq.POSIXt(from =  start_d, to = end_d, by = "6 hours")
vec_dt_10 <- seq.POSIXt(from =  start_d, to = end_d, length.out = 10)
vec_dt_20 <- seq.POSIXt(from = start_d, by = "5 mins", length.out = 20)

One can now look at the examples:

  • increments per 6 hours:
vec_dt_hour
 [1] "2025-01-01 12:00:00 CET" "2025-01-01 18:00:00 CET"
 [3] "2025-01-02 00:00:00 CET" "2025-01-02 06:00:00 CET"
 [5] "2025-01-02 12:00:00 CET" "2025-01-02 18:00:00 CET"
 [7] "2025-01-03 00:00:00 CET" "2025-01-03 06:00:00 CET"
 [9] "2025-01-03 12:00:00 CET" "2025-01-03 18:00:00 CET"
[11] "2025-01-04 00:00:00 CET" "2025-01-04 06:00:00 CET"
[13] "2025-01-04 12:00:00 CET" "2025-01-04 18:00:00 CET"
[15] "2025-01-05 00:00:00 CET" "2025-01-05 06:00:00 CET"
[17] "2025-01-05 12:00:00 CET"
  • length out between the start and end equal to 10:
vec_dt_10
 [1] "2025-01-01 12:00:00 CET" "2025-01-01 22:40:00 CET"
 [3] "2025-01-02 09:20:00 CET" "2025-01-02 20:00:00 CET"
 [5] "2025-01-03 06:40:00 CET" "2025-01-03 17:20:00 CET"
 [7] "2025-01-04 04:00:00 CET" "2025-01-04 14:40:00 CET"
 [9] "2025-01-05 01:20:00 CET" "2025-01-05 12:00:00 CET"
  • starting from the start and incrementing 20 times by 5 minutes:
vec_dt_20
 [1] "2025-01-01 12:00:00 CET" "2025-01-01 12:05:00 CET"
 [3] "2025-01-01 12:10:00 CET" "2025-01-01 12:15:00 CET"
 [5] "2025-01-01 12:20:00 CET" "2025-01-01 12:25:00 CET"
 [7] "2025-01-01 12:30:00 CET" "2025-01-01 12:35:00 CET"
 [9] "2025-01-01 12:40:00 CET" "2025-01-01 12:45:00 CET"
[11] "2025-01-01 12:50:00 CET" "2025-01-01 12:55:00 CET"
[13] "2025-01-01 13:00:00 CET" "2025-01-01 13:05:00 CET"
[15] "2025-01-01 13:10:00 CET" "2025-01-01 13:15:00 CET"
[17] "2025-01-01 13:20:00 CET" "2025-01-01 13:25:00 CET"
[19] "2025-01-01 13:30:00 CET" "2025-01-01 13:35:00 CET"

As you can see from these examples, the way to use as.POSIXt() is very similar to the way you use seq.Date() or seq().

Generate a vector, vec_yt1 as a sequence

  • starting at 2 and ending at 12 in steps of 2
 #| code-fold: true

vec_yt1 <- seq(from = 2, to = 12, by = 2)
  • starting at 10 and ending at 0 in steps of -1
 #| code-fold: true

vec_yt1 <- seq(from = 10, to = 0, by = -1)
  • starting at 0, in steps of 5 with a length of 5
 #| code-fold: true

vec_yt1 <- seq(from = 0, by = 5, length.out = 5)
  • starting at O, ending at 14 in steps of 3. Before you unfold the code: what will this vector look like?
 #| code-fold: true

vec_yt1 <- seq(from = 0, to = 14, by = 3)
  • using the shorted possible code, write a sequence starting at 5, ending at 50 in steps of 1.
 #| code-fold: true

vec_yt1 <- 5:50

Suppose you have a date 2025-03-25 and you need a sequence of 6 dates by week. Write the do to create this sequence and store in a vector vec_ytd:

Code
vec_ytd <- seq.Date(from = as.Date("2025-03-25"), by = "weeks", length.out = 6)
vec_ytd
[1] "2025-03-25" "2025-04-01" "2025-04-08" "2025-04-15" "2025-04-22"
[6] "2025-04-29"
Code
class(vec_ytd)
[1] "Date"

Generate a vector, vec_yty that starts at 2000-01-01 and end 2024-12-31 by year. Format the dates so that they only show the year (hint use: ?format()) and use the pipe operator in your code.

Code
vec_yty <- seq.Date(from = as.Date("2000-01-01", format = "%Y-%m-%d") , to = as.Date("2024-12-31", format = "%Y-%m-%d"), by = "year") |>
format(format = "%Y")

4.1.4.2 Random numbers

We already covered statistical functions when we discussed numeric data. In that section, we showed how you can use pnorm(), dnorm(), qnorm() and rnorm(). However, with respect to the latter, rnorm(), we didn’t add too much detail. The same holds for the other function to generate random numbers from e.g. the t-distribution rt(), the uniform distribution runif(), the F-distribution rf() or rchisq()for the Chi-square distribution. In simulations, these random number generators are widely used. Before we move into these random number generates, a few words about the way software generates these numbers. Random number generators are not “random” but they follow an algorithm to generate a sequence of numbers whose properties approximate a random sequence. In other words, random numbers are not random, but their value is determined by and initial value that is used by the algorithm that generates this sequence. This is why random number generators are called pseudo random number generators. They generate a sequence that mimics the properties of a random sequence, but the sequence is fully determined by and initial value. That initial value is called the seed. There are many pseudo random number generators, but the same pseudo random number generator will produce the same sequence of random numbers if the seed it the same. In R, you can select the pseudo random number generator. The default is “Mersenne-Twister”. You can see all other pseudo random number generators that are available if you use ?Random in the console. Using set.seed, you can make sure that R generates the same sequence of random numbers, every time you ask R to generate a series. This function sets the initial value for the pseudo random number generator. Each time you use this value, you’ll get the same results. This is useful is you want to replicate your results. In addition, if you build a simulation, it is often useful to have the same sequence every time to add components to the simulation’s model.

Every statistical distribution is characterized by its parameters. For the normal distribution, these are the mean and the standard deviation, for Student’s t-distribution as well as the Chi square distribution this parameter is the degrees of freedom, for the uniform distribution you need the minimum and the maximum and for the F-distribution, the ratio of two independent chi square distributed variables, you need two degrees of freedom. If you supply these parameters, you can generate random numbers of these distributions:

set.seed(1000)
v_norm <- rnorm(n = 100, mean = 0, sd = 1)
v_t <- rt(n = 100, df = 5)
v_unif <- runif(n = 100, min = 0, max = 100)
v_chi <- rchisq(n = 100, df = 5)
v_f <- rf(n = 100, df1 = 10, df2 = 2)

With 100 random draws each, we can show the probability density distribution of each of these 5 randomly generated values using base R’s hist() function:

hist(v_norm, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Normal")
lines(density(v_norm), lwd = 3, col = "darkgrey")

hist(v_t, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Student's t")
lines(density(v_t), lwd = 3, col = "darkgrey")

hist(v_unif, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Uniform")
lines(density(v_unif), lwd = 3, col = "darkgrey")

hist(v_chi, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Chi squared")
lines(density(v_chi), lwd = 3, col = "darkgrey")

hist(v_f, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "F distribution")
lines(density(v_f), lwd = 3, col = "darkgrey")

4.1.4.3 Sampling

A sample refers to a subset of values from a vector that are drawn random. sample(x, size, replace = FALSE, prob = NULL) allows you to draw a random sample of size n, from a vector x . By default, sampling is done without replacement. In other words, an element can not appears twice in the sample unless it is included more than once in the vector x. In addition, all elements are equally likely to be drawn (prop = NULL). To illustrate this function, let’s use

vec_1 <- seq(1:48)

and draw a sample, without replacement, of size = 10:

sample(x = vec_1, size = 10)
 [1] 16 43 44 48 35 33 25 47 17 12

If you draw a sample with replacement (replace = TRUE), each draw is returned to the vector and could be drawn again.

sample(x = vec_1, size = 10, replace = TRUE)
 [1] 26 10 47 10 37 42 46 27 23 33

Sampling is not limited to numeric vectors

sample(x = c("a", "b", "c", "d", "e", "f", "g", "h"), size = 10, replace = TRUE)
 [1] "g" "d" "f" "e" "b" "c" "c" "g" "f" "d"

Without replacement, the sample size must be smaller than the length of the vector. With replacement, that is not the case. In the previous example for instance, the length of the vector was 8, while the size was 10. Without replacement, size = 8 would be equal to the vector and any size > 8 would not leave sufficient values to sample from. If some values in the sample need a higher probability of being drawn, you need to add a vector with probability weights.

As a special case, if you only include the vector x, R returns a random permutation of the vector’s values:

x <- 1:10
sample(x)
 [1]  6  2  9  5  7  1  3 10  8  4

4.1.4.4 Set operations

Using set operators, you determine is an element in one vector is also an element in another, if that is not the case or you merge the elements of both in one new vector.

Suppose you have two vectors,

vec_1 <- c(10, 20, 30, 40)
vec_2 <- c(20, 30, 40, 40)

and you want to know if both share common elements. There are various ways to check if that is the case. The first uses the intersect() function. This function has two arguments: the vectors you want to compare. Note that if you load {dplyr}, the package masks this function. To instruct R to use base R’s intersect, you need to add ´base::`. The same holds for some other functions in this section.

base::intersect(vec_1, vec_2)
[1] 20 30 40

The output shows the values that these two vectors have in common. If you want to store these values, out assign them to a new vector. Note that this also allows to see how many values both vectors have in common. Using the length() function, you can verify how many (unique) values are common to both vectors:

length(base::intersect(vec_1, vec_2))
[1] 3

Here, we used numeric values, but it you can finds common strings in character vectors in a similar way:

friends <- c("Monica", "Phoebe", "Joey", "Chandler", "Ross", "Rachel") 
collegues <- c("Taylor", "David", "Joey", "Sandra")
base::intersect(friends, collegues)
[1] "Joey"

This example also shows that the vectors don’t have to have the same length. If there are no common values, R will output the null vector:

base::intersect(c(10, 20), c(50, 60))
numeric(0)

is.element(x, y) allows you to determine if elements of one vector, x, are included in the other y. The outcome is be a boolean vector whose values are TRUE if an element from x occurs in y and FALSE otherwise.

is.element(vec_1, vec_2)
[1] FALSE  TRUE  TRUE  TRUE

The values in the last three columns in vec_1 are also included in vec_2. Using the %in% operator has the same outcome as it checks which values on its left hand side vector are include in its right hand side vector:

vec_1 %in% vec_2
[1] FALSE  TRUE  TRUE  TRUE

You can also use this result to see how many elements from the first vector are also in the second. Here, you use the fact that TRUE is also 1 and FALSE is 0:

sum(is.element(vec_1, vec_2))
[1] 3

Note that the order of the vectors matters. If you use is.element(x, y) you check if the elements from x are included in y. With is.element(y, x) you determine the elements in y that are also in x. In the example, you can see that changing the order in the is.element() function shows a different output as 40 is includes in vec_2 twice, but is only once included in vec_1

vec_1[is.element(vec_1, vec_2)]
[1] 20 30 40
vec_2[is.element(vec_2, vec_1)]
[1] 20 30 40 40

Recall that using the “!” you can check if a condition is not met. Here, you can use this to see which elements of x are not in y

!is.element(vec_1, vec_2)
[1]  TRUE FALSE FALSE FALSE

base::setdiff(x y) allows you to look for elements that are different, in other words, which elements from x are not included in y. While !is.element(x, y)’s output is a boolean vector, base::setdiff() shows the values of x that are not included in y.

base::setdiff(vec_1, vec_2)
[1] 10

Note again that the order of the vectors matters.

To create a union of x and y, there is the base::union(x, y) function. This function shows the unique values after merging the values in x and y:

base::union(vec_1, vec_2)
[1] 10 20 30 40

If you want to know positions of these common elements, you can use the which() function:

which(is.element(vec_1, vec_2))
[1] 2 3 4

The unique(x, incomparables = FALSE) function determines the unique values in a vector. Suppose that you have a vector

vec_char <- c("jan", "jan", "feb", "mar", "mar", "apr")

This vector has 4 unique values: “jan”, “feb”, “mar” and “apr”. Using the unique() function, you can select the unique values:

unique(x = vec_char)
[1] "jan" "feb" "mar" "apr"

If you want to exclude one value, you can add it to the incomparables = argument. For instance, suppose that you want to see all unique values, except January, you can add incomparables = c("jan"):

unique(vec_char, incomparables = c("jan"))
[1] "jan" "jan" "feb" "mar" "apr"

R will now show all occurrences of “jan” as well as the unique values of all others.

Generate a vector, vec_rn with 20 draws from a normal distribution with mean 5 and standard deviation 10

 #| code-fold: true

vec_rn <- rnorm(20, mean = 5, sd = 10)

Generate a vector, vec_ru with 20 draws from a uniform distribution with minimum 5 and maximum 10. Write this code without naming the arguments.

Code
vec_ru <- runif(20, 5, 10)

Using vec_rn draw a sample of 6 observations with replacement and assign these to a vector vec_rns

Code
vec_rns <- sample(vec_rn, size = 6, replace = TRUE)

A lottery includes a weekly draw of 6 numbers, without replacement, from a bowl with all numbers from 1 to 40. To play, you buy a ticket with 6 numbers, from 1 to 40. You win something if at least two numbers on your ticket are drawn. Your numbers are 3, 9, 25, 36, 37, 39. Simulate this lottery. To do so, first sample the weekly draw. Second, determine how many of your numbers match the numbers of the draw. Use 3 ways to calculate the number of winning numbers.

Code
draw <- sample(1:40, 6, replace = FALSE)
ticket <- c(3, 9, 25, 36, 37, 39)

# Option 1: use intersect
win <- length(intersect(draw, ticket))
win
[1] 1
Code
# Option 2: use is.element
win <- sum(is.element(ticket, draw))
win
[1] 1
Code
# Option 3: use %in%
win <- sum(ticket %in% draw)
win
[1] 1

4.1.5 Special vectors

R includes a number of special vectors. For instance, the vectors “letters” and “LETTERS” include the letters of the alphabet. The first lowercase, the second uppercase

letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"

In addition to letters, the vectors “month.abb” and ’month.name” include the names of the month:

month.abb
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name
 [1] "January"   "February"  "March"     "April"     "May"       "June"     
 [7] "July"      "August"    "September" "October"   "November"  "December" 

4.1.6 Subsetting a vector

If you subset a vector, you select one or more columns of that vector (to possibly store them in a new one). We first start with the general case: an unnamed vector. We then continue with the special case of a named vector. Note that all methods for an unnamed vector can be used for named vectors.

4.1.6.1 Subsetting an unnamed vector

Here, we will use the numeric vector vec_num:

vec_num <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)

Note that you can apply most of these ways to subset to other vector types: character, data/time or logical vectors. The approach wouldn’t differ in case you would use any of these other vector types. There are two subset operators: [] and [[ ]].

4.1.6.1.1 Subsetting by position

To access an individual element of a vector, you include its position (or index number) between square brackets of the subscript operator [] after the name of the vector. In R, vector indexing starts at 1. In other words, the first element of a 1x10 vector is at position 1, the second at position 2, … This is not always the case. In Python for instance, the first element of a vector is at position 0, the second at position 1, …

Let’s look at the element 5 of the element in the fifth column of vec_num:

vec_num[5]
[1] 3

If you want to extract that element to use it in part of your code, you would assign it to a different vector using the <- operator:

a <- vec_num[5]
a
[1] 3

Note that subsetting leaves the original vector intact. If you subset a vector, you copy the value in a new vector, but that value stays in the original vector.

You can subset more than one column. Suppose that you want to subset columns 1 to 4. To do so, you can use 1:4 within the subscript operator:

vec_num[1:4]
[1] 0 1 1 2

Again, you could assign this new vector. Here, this new vector would have 1 row and 4 columns. These 4 columns would be equal to the first 4 columns of the original vector.

The third way to access elements in a vector using their position is to combine these position via the c() function within the subsetting operator. The c() function allows you to define the columns you need. The subscript operator will then access these columns and extract their value. Suppose that you want to extract the elements in columns 1 and 4. Note that here, you will extract to columns: 1 and 4. In the previous example you extracted 4 columns: 1 to 4 or column 1, 2, 3, and 4. To extract columns 1 and 4 you need to include those position in the c() function: c(1, 4) and use:

vec_num[c(1, 4)]
[1] 0 2

Note that you can mix various ways to subset a vector. For instance, if you need the first to third, fifth and seventh to last element, you can combine the various way to subsetting the vector:

vec_num[c(1:3, 5, 7:10)]
[1]  0  1  1  3  8 13 21 34
4.1.6.1.2 Subsetting using negative positions

You can also use negative numbers for the index elements. In that case, R will show all elements, except those in the negative index (negative index range). For instance,

  • Accessing all elements except for the first:
vec_num[-1]
[1]  1  1  2  3  5  8 13 21 34
  • Accessing all elements except for the first, second, third and fourth (all in that range):
vec_num[-1:-4]
[1]  3  5  8 13 21 34
  • Accessing all elements except for the second and fourth:
vec_num[(c(-2, -4))]
[1]  0  1  3  5  8 13 21 34
vec_num[-c(2, 4)]
[1]  0  1  3  5  8 13 21 34
4.1.6.1.3 Subsetting by using a logical vector

The fourth way to subset columns in a vector uses a logical vector of the same length as the vector to subset. To see how this works, let’s first define two vectors: one numeric and one logical:

vec_1 <- c(1, 2, 3, 4, 5)
vec_log <- c(TRUE, FALSE, FALSE, FALSE, TRUE)

You can now subset vec_1 using vec_log:

vec_1[vec_log]
[1] 1 5

If the value on position x in vec_log is “TRUE”, the result of vec_1[vec_log] is equal to the value in the xth column of vec_1. This is the case for the first and last value. If vec_log’s yth element is false, vec_1’s yth element is not extracted.

In the example, we defined the logical vector ourselfs. However, there are many other ways to create such a vector. Recall that the outcome of any boolean operation is either TRUE or FALSE. Applying a boolean operation to every column of a vector creates a logical vector of the same length as the vector where the operation was applied to. You can now select those columns that meet that condition. For instance, suppose you want to work with the elements of vec_num that are larger than 5. There are two ways to do so. First, you create a logical vector of the same length as vec_num where an element is TRUE is the element in vec_num on the same position meets the condition and false otherwise. To create that vector, you use logical vector <- original vector + condition. As we will see shortly, boolean operators applied to a vector are applied to every element of that vector. In other words, the logical vector will have the same length as the vector whose elements you want to extract.

cond <- vec_num > 5
cond
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

Now we have a logical vector cond whose values are TRUE if the element in the same position in vec_num meets the condition (> 5) and FALSE otherwise. We can now use this vector to subset vec_num:

vec_num[cond]
[1]  8 13 21 34

Here, the TRUE-FALSE elements of cond are used to subset vec_num. Is an element in cond is TRUE, vec_num[cond] extracts that element from vec_num. If the element in cond is FALSE, the element in the same position in vec_num is not extracted.

The second option to use a condition is shorter and uses the condition within the subscript operator:

vec_num[vec_num > 5]
[1]  8 13 21 34

Note that you can use more than one boolean operator. For instance extracting all elements larger than 3 and not equal to 13 can be done using:

cond <- vec_num > 3 & !(vec_num == 13)
vec_num[cond]
[1]  5  8 21 34

or

vec_num[vec_num > 3 & !(vec_num == 13)]
[1]  5  8 21 34

Note that you can use these conditions also in the case of character vectors. For instance, to see if “cat” and “dog” are values in vec_char:

vec_char[vec_char == "dog" | vec_char == "cat"]
character(0)

If you don’t know the exact location and you don’t have an explicit condition that you can use, but you know which values you want to extract, you can use the %in% operator. Here, you first define a vector with values, e.g. 1, 8 and 143 using c(1, 8, 143). Using the %in% operator, you can now subset the vector vec_num:

vec_num %in% c(1, 8, 143)
 [1] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

This code extract all elements from vec_num that are also in c(1, 8, 143). The result of this operation is a logical vector with the same length as vec_num where the elements equal TRUE is an element in vec_num is included in c(1, 8, 143) and FALSE otherwise. This vector now allows you to extract these elements using, e.g.

vec_num[vec_num %in% c(1, 8, 143)]
[1] 1 1 8

As an alternative, you can do so in two steps:

cond <- vec_num %in% c(1, 8, 143)
vec_num[cond]
[1] 1 1 8

Here we included the elements to extract in a c() function. However, you can use any other vector.

a <- c(1, 8, 143)
vec_num[vec_num %in% a]
[1] 1 1 8

Note that you can include values in %in% c() which are not part of the vector. If this is the case, R won’t find them. Hence, adding them wouldn’t change the outcome. If no element in the original vector matches the condition, R will output numeric(0) or a vector of length 0.

To illustrate this and the use of %in% for a character vector, suppose you want to extract “dog”, “fish” from vec_char:

vec_char <- c("dog", "cat", "rabbit")
vec_char[vec_char %in% c("dog", "fish")]
[1] "dog"

As “fish” is not included in vec_char but “dog” is and #rabbit” is part of vec_char but isn’t in c("dog", "fish"), R outputs “dog”.

Using is.element(), you can subset a vector (the first) and extract all values that are both the the first and the second vector. To illustrate:

vec_1 <- c(1:10)
vec_2 <- c(7:15)
vec_1[is.element(vec_1, vec_2)]
[1]  7  8  9 10

Boolean operators allow you define many conditions. For instance, if you have a vector that includes missing values, you can extract all non missing values using !is.na() or is not (!) NA:

vec_1 <- c(1, 20, 20, NA, NA, 50)
vec_1[!is.na(vec_1)]
[1]  1 20 20 50

As a second example, suppose that you want to extract all even numbers. Recall that a number of even if the modulus after division by 2 is zero:

10 %% 2
[1] 0
11 %% 2
[1] 1

You can use this to create a condition

10 %% 2 == 0
[1] TRUE
11 %% 2 == 0
[1] FALSE

that you can use to subset elements of a vector:

cond <- vec_num %% 2 == 0
vec_num[cond]
[1]  0  2  8 34

As a third example, recall that you can use grepl() or stringr::string_detect() if a pattern occurs in a string. If the is the case, these function output TRUE. Suppose you have a character vectors

vec_char <- c("sales_shoes", "sales_trousers", "sales_shirts", "sales_jackets")

and you want to extract the column which includes “shoes”. Using grepl() you can identify the elements in the character vector that include the word “shoes”:

grepl(pattern = "shoes", x = vec_char)
[1]  TRUE FALSE FALSE FALSE

You can use this function to extract the element “sales_shoes” from vec_char.

vec_char[grepl(pattern = "shoes", x = vec_char)]
[1] "sales_shoes"
4.1.6.1.4 Index positions

In all examples we either used an exact index position or a logical vector to extract the values of a vector. What if you are not interested in a value but in an index position? To show an index position rather than its value or TRUE or FALSE, you can use the which() function. For instance, suppose you want to know the position of value 1 in vec_num. To find this position, you can use

which(vec_num == 1)
[1] 2 3

The result shows the index positions where you can find 1 in vec_num. Note that you can save the output in a new vector with positions. You can now subset that vector to find the first occurrence. As an alternative, as which() outputs a vector, you can find the first occurrence subsetting the which() function. For instance, to find the first 1 in vec_num:

which(vec_num == 1)[1]
[1] 2

What if you want to find multiple values. Here, you can use the %in% operator. Suppose you want to know the position of the values 1, 2, 8 and 55. First you collect these values in a vector using c(1, 2, 8, 55). You can now use that vector in the which() function:

which(vec_num %in% c(1, 2, 8, 55))
[1] 2 3 4 7

which() shows every occurrence. Using match() you can find the first occurrence. For instance, the first occurrence of “1” in vec_num is in position

match(1, vec_num)
[1] 2

Using which() allows you to extract the positions of e.g. missing values. Suppose you have a vector vec_1 which includes missing values:

vec_1 <- c(10, 10, 20, NA, 30, 40, NA, 50, 50)

To locate these missing value, you can use

which(is.na(vec_1), vec_1)
[1] 4 7

There are two variants of the which() function that allow you to find the location of the (first) maximum or minimum values: which.max() and which.min:

which.max(vec_1)
[1] 8
which.min(vec_1)
[1] 1

Using locigal values, you can find the first occurrence of specific value. Here, which.max() uses the fact that TRUE = 1 and FALSE = 0. In other words, this function will show the first occurrence of TRUE:

which.max(vec_1 > 30)
[1] 6

4.1.6.2 Subsetting a named vector

With a named vector, you can also use the column names to subset. Suppose that you have a vector

vec_1 <- c(A = 10, B = 30, C = 50, D = 70)

First you can use the ways you would use to subset an unnamed vector, e.g.

vec_1[3]
vec_1[2:4]
vec_1[vec_1 < 50]

As you can see using [] preserves the structure of the vector: the output shows both the column name as well as its value.

You can also use the name of the column to subset the vector using vec_1["column name"]:

vec_1["A"]
 A 
10 

The output shows both the column name as well as the value. In other words, here too, the structure of the vector is preserved.

To subset more than one column, you can use

vec_1[c("A", "D")]
 A  D 
10 70 

To extract the value, you have to refer to the column using subsetting operator [[]]. You can do so using both the column name or number. These lines extract the value for the second column

vec_1[["B"]]
[1] 30
vec_1[[2]]
[1] 30

The output shows the value without the column. The [[]] operator simplifies the structure of the vector: it returns the simplest possible data structure: here this is the value of the column, i.e. an unnamed vector.

You can also subset column whose name includes a pattern. Recall that names() allow you to extract the names of the columns in a named vector. Using grepl() you can check if these names include a pattern. For instance, let’s check if the names of vec_1 include “A”. Using grepl():

grepl(pattern = "A", x = names(vec_1))
[1]  TRUE FALSE FALSE FALSE

To extract that column, you include that statement in vec_1[]:

vec_1[grepl(pattern = "A", x = names(vec_1))]
 A 
10 

The result shows the name of the column and its value.

Generate a vector:

vec <- c(21:30)

Extract the following elements from this vector:

  • the 5th element
Code
vec[5]
[1] 25
  • all elements from 1 to 5
Code
vec[1:5]
[1] 21 22 23 24 25
  • elements in columns 1, 3 and 9
Code
vec[c(1, 3, 9)]
[1] 21 23 29
  • all elements except columns 1, 3 and 9
Code
vec[-c(1, 3, 9)]
[1] 22 24 25 26 27 28 30
  • all elements larger than 25
Code
vec[vec > 25]
[1] 26 27 28 29 30

Use this vector

vecchar <- c("dog", "fish", "cat", "bird", "duck", "rabbit")

to extract all patterns animals that whose name includes an “a”

Code
vecchar[grepl(pattern = "a", vecchar)]
[1] "cat"    "rabbit"

4.1.7 Adding, removing and changing elements to a vector

4.1.7.1 Adding elements to a vector

As in the previous section, I’ll use a numeric vector here, but you can apply the rules also to other types of vectors. Suppose that you have the 1x10 vector vec_num and you want to add a column with the value 55. The first way to do so is to use the c() function to create a new vector

c(vec_num, 55)
 [1]  0  1  1  2  3  5  8 13 21 34 55

In this way you can add multiple columns and or multiple vectors:

c(vec_num, c(55, 89, 144), c(233, 377, 610))
 [1]   0   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610

c() adds all elements in the order in which they appear in the function:

c(c(610, 377, 233), c(144, 89, 55), vec_num)
 [1] 610 377 233 144  89  55   0   1   1   2   3   5   8  13  21  34

Note that this doesn’t change the vec_num. c()creates a new vector. If you want to change vec_num you have to reassign it to the new vector. As an alternative, you can assign the new vector to a new object:

vec_1 <- c(vec_num, c(55, 89, 144), c(233, 377, 610))
vec_1
 [1]   0   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610

If you have a named vector, you can add a new named vector:

vec_1 <- c(A = 10, B = 30, C = 50, D = 70)
c(vec_1, c(E = 90))
 A  B  C  D  E 
10 30 50 70 90 

You can also use the append() function to add new elements. By default, append will add an element after the last element in the existing vector. In other words, by default, append() is similar to c(). However, the arguments in the append(vector, value, after = length(x)) allow you to change that default position. If you want to add the new element after position 3, you can add this by changing the default length(x)in 3. Note that append()doesn’t change the original vector:

append(vec_num, 55)
 [1]  0  1  1  2  3  5  8 13 21 34 55
vec_num
 [1]  0  1  1  2  3  5  8 13 21 34

If you want to change the original vector, you have to reassign it to its new values or assign the outcome to a new object:

vec_1 <- append(vec_num, 144)
vec_1
 [1]   0   1   1   2   3   5   8  13  21  34 144

To add the value 88 as the first element or 143 after column 9, you can change the default location in append()’s after = argument:

append(vec_num, 88, after = 0)
 [1] 88  0  1  1  2  3  5  8 13 21 34
append(vec_num, 143, after = 9)
 [1]   0   1   1   2   3   5   8  13  21 143  34

Using the c() function, you can add multiple elements. For instance, if you want to add 88 and 143 as the first two columns of vec_num you combine these two values within c() and include them in the append statement:

append(vec_num, c(88, 143), after = 0)
 [1]  88 143   0   1   1   2   3   5   8  13  21  34

Note that you can change the position where these new values are added. However, all elements are added after the same position and their position follows their position within the c() function. Note also that, if you add an element whose type of different from the vector type, R will change the vector type.

You can also add a named vector

append(vec_1, c(E = 50), after = 0)
  E                                             
 50   0   1   1   2   3   5   8  13  21  34 144 

4.1.7.2 Removing elements from a vector

There are multiple ways to remove elements from a vector. We already covered two. First, if you know the position of the elements you want to remove, you can use a negative index. Recall that a negative index allows you to extract the elements of a vector except those included in the negative index. For instance, if you want to remove the first 4 columns of vec_num you can do this using

vec_num[-1:-4]
[1]  3  5  8 13 21 34

To remove column 1 and 4 (but not 2 and 3):

vec_num[-c(1, length(vec_num))]
[1]  1  1  2  3  5  8 13 21

or

vec_num[c(-1, -length(vec_num))]
[1]  1  1  2  3  5  8 13 21

You can use this approach if you know the exact location (i.e. the columns) who want to remove.

The second way to remove elements uses a condition. For instance, the code to remove all elements larger than 3 and not equal to 0 is

vec_num[!vec_num > 3 & !(vec_num == 0)]
[1] 1 1 2 3

or, using a specific vector including the condition:

cond <- !vec_num > 3 & !(vec_num == 0)
vec_num[cond]
[1] 1 1 2 3

You can use this approach if you know the condition that elements need to meet.

If you want to remove known values from a vector, e.g. 1, 8 and 143, you can use an approach which is very similar to the one you used to subset these elements. First, you collect them in a vector c(1, 8, 143). Second, you use %in% and not (!) to remove these elements:

cond <- vec_num %in% c(1, 8, 143)
vec_num[!cond]
[1]  0  2  3  5 13 21 34

or, in one line of code

vec_num[!vec_num %in% c(1, 8, 143)]
[1]  0  2  3  5 13 21 34

In the last statement, 143 was included in the vector with values to remove but is not in vec_num. R doesn’t check if all values to be removed are also in the vector where they need to be removed.

4.1.7.3 Changing elements in a vector

Suppose that you know which column you want to change in your vector, e.g. you want to change the value in 4th column. To do this, you first subset that element using vec_num[4] and your reassign its value. For instance, changing the fourth element to 250:

vec_num[4] <- 250
vec_num
 [1]   0   1   1 250   3   5   8  13  21  34

As you can see, fourth element is now 250. Note that the new value needs to be of the same type as the vector. If that is not the case, you”ll change the type of all other elements in the vector. For instance

vec_num[4] <- "250"

changes the type of the vector from double to character:

typeof(vec_num)
[1] "character"

In that case, you have to change the vector’s type:

vec_num <- as.numeric(vec_num)

Using replace() you can change many values in a vector. Suppose you want to change columns 1, 8 and 10 in 50, 100, 150. The first argument in the replace() function is the vector you want to change. Here, this is vec_num. The second argument is a vector with index position. Using c(1, 8, 10) you can fix these position. The last argument is a vector with the values that will be used to replace the values in the index positions. Here you would use c(50, 100, 150). Using these in the replace() function:

replace(vec_num, c(1, 8, 10), c(50, 100, 150))
 [1]  50   1   1 250   3   5   8 100  21 150

Note that the length of the index vector and the length of the vector with new values should be equal. If this is not the case, R will show an error:

replace(vec_num, c(1, 8, 10), c(50, 100, 150, 200))
Warning in x[list] <- values: number of items to replace is not a multiple of
replacement length
 [1]  50   1   1 250   3   5   8 100  21 150

If you want to replace all values that meet a certain condition with one single value, you can use the replace() function as well. Suppose you want to change all values larger than 25 with 50. Using recplace() you could do this with:

replace(vec_num, vec_num > 25, 50)
 [1]  0  1  1 50  3  5  8 13 21 50

Changing the vector’s type is another way to change a vector. Suppose you have a vector

vec_dat_char <- c("01-01-2025", "02-01-2025", "03-01-2025")

This vector is a character vector:

typeof(vec_dat_char)
[1] "character"

You can change this type to Date or POSIX using as.Date() or as.POSIXct(). Using the first:

as.Date(vec_dat_char, format = "%d-%m-%Y")
[1] "2025-01-01" "2025-01-02" "2025-01-03"

In a similar way, you can change the typeof numeric variables in character, dates in numeric, … .

4.1.7.4 Sorting vectors

To sort a vector, R includes the sort(x, decreasing = FALSE, na.last = NA) function. Here, x is the vector to sort. By default, R sorts in increasing order. The last argument includes the treatement of “NA” values. By default, they are removed. Using TRUE missing values are retained, but added last. FALSE shows these values first.

sort(x = vec_num, decreasing = FALSE)
 [1]   0   1   1   3   5   8  13  21  34 250

Character vectors are sorted alphabetically by default:

sort(x = c("zoo", "Zoo", "coast", "coAst", "cOAst", "lake"))
[1] "coast" "coAst" "cOAst" "lake"  "zoo"   "Zoo"  

As you can see, if the strings include copies where one includes a uppercase letter and the other one doesn’t, R orders those with the lowest number of uppercase letters first.

Generate a vector:

vec <- c(21:30)

Change the this vector

  • add a c(31, 32, 33, 34, 34) after the vast position in vec. Use two methods to do so. Store the results in vec_r:
Code
# Option 1: use c()
vec_r <- c(vec, c(31, 32, 33, 34, 34))
vec_r
 [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 34
Code
# Option 2: use append()
vec_r <- append(vec, c(31, 32, 33, 34, 34))
vec_r
 [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 34
  • add c(31, 32, 33, 34, 34) as the first elements of vec. Use two methods to do so. Store the results in vec_r:
Code
# Option 1: use c()
vec_r <- c(c(31, 32, 33, 34), vec)
vec_r
 [1] 31 32 33 34 21 22 23 24 25 26 27 28 29 30
Code
# Option 2: use append() and add position
vec_r <- append(vec, c(31, 32, 33, 34), after = 0)
vec_r
 [1] 31 32 33 34 21 22 23 24 25 26 27 28 29 30
  • add c(31, 32, 33, 34, 34) after the fifth element of vec. Store the results in vec_r:
Code
vec_r <- append(vec, c(31, 32, 33, 34, 34), after = 5)
vec_r
 [1] 21 22 23 24 25 31 32 33 34 34 26 27 28 29 30

Using vec_r you created in the last exercise:

  • remove the columns 6 to 10
Code
vec_r <- vec_r[-6:-10] 
vec_r
 [1] 21 22 23 24 25 26 27 28 29 30
  • change the firth column to 250
Code
vec_r[5] <- 250
  • replace the values on position 1, 2 and 3 with 210, 220 and 230, store the result in vec_r.
Code
vec_r <- replace(vec_r, c(1, 2, 3), c(210, 220, 230))
vec_r
 [1] 210 220 230  24 250  26  27  28  29  30
  • replace all elements smaller than 100 with 100
Code
replace(vec_r, vec_r < 100, 100)
 [1] 210 220 230 100 250 100 100 100 100 100

Using vec, sort this vector in decreasing and increasing order.

Code
sort(vec, decreasing = TRUE)
 [1] 30 29 28 27 26 25 24 23 22 21
Code
sort(vec)
 [1] 21 22 23 24 25 26 27 28 29 30

4.1.8 Functions and vectors

Many operations in R are vectorized. This means that an operator works on a vector’s individual elements. For functions, that means that R, for most of them, applies them to every element of that vector.

4.1.8.1 Numeric vectors

We introduced mathematical operators and function, statistical function and e.g. rounding in the previous chapter. Almost all these are vectorized. All operators and function generate output. In you want to store these results you have to assign them to a new object. Here this object is usually a vector. In the examples this assignment is left out to keep code short.

4.1.8.1.1 Mathematical operators and functions

Let’s first create a vector, vec_num1 and vec_num2

vec_num1 <- c(10, 10, 20, 30, 50, 80, 130, 210, 340, 550)
vec_num2 <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55)

If you add, subtract a numeric value to or from a vector or if you multiply that numeric vector with of divide it by a numeric value, R applies this operation to every element of the vector. For instance

  • addition:
vec_num1 + 100
 [1] 110 110 120 130 150 180 230 310 440 650
  • subtraction:
vec_num1 - 100
 [1] -90 -90 -80 -70 -50 -20  30 110 240 450
  • multiplication:
vec_num1 * 10
 [1]  100  100  200  300  500  800 1300 2100 3400 5500
  • division:
vec_num1 / 25
 [1]  0.4  0.4  0.8  1.2  2.0  3.2  5.2  8.4 13.6 22.0
  • integer division:
vec_num1 %/% 3
 [1]   3   3   6  10  16  26  43  70 113 183
  • modulus:
vec_num1 %% 3
 [1] 1 1 2 0 2 2 1 0 1 1

Applied to two vectors of the same length, R add, subtracts, multiplies or divides each element in one vector to/from/with the corresponding element in the other vector:

  • addition:
vec_num1 + vec_num2
 [1]  11  11  22  33  55  88 143 231 374 605
  • subtraction:
vec_num1 - vec_num2
 [1]   9   9  18  27  45  72 117 189 306 495
  • multiplication:
vec_num1 * vec_num2
 [1]    10    10    40    90   250   640  1690  4410 11560 30250
  • division:
vec_num1 / vec_num2
 [1] 10 10 10 10 10 10 10 10 10 10
  • integer division:
vec_num1 %/% vec_num2
 [1] 10 10 10 10 10 10 10 10 10 10
  • modulus:
vec_num1 %% vec_num2
 [1] 0 0 0 0 0 0 0 0 0 0

Note that this save a lot of work. Without vectorization, to add two vectors, you would have to write some code, e.g.:

if (length(vec_num1) != length(vec_num2)) {               
  print("Can not add vectors of a different length")      
} else {
  vec_num4 <- vector("numeric", length = length(vec_num1))
  for (i in 1:length(vec_num1)) { 
    vec_num4[i] <- vec_num1[i] + vec_num2[i]
   }
}
vec_num4
 [1]  11  11  22  33  55  88 143 231 374 605

For functions, let’s illustrate vectorisation using the of vec_num1. All functions where introduced in previous sections.

  • absolute value:
abs(-vec_num1)
 [1]  10  10  20  30  50  80 130 210 340 550
  • logarithm base e (natural logarithm):
log(vec_num1)
 [1] 2.302585 2.302585 2.995732 3.401197 3.912023 4.382027 4.867534 5.347108
 [9] 5.828946 6.309918
  • logarithm base 10:
log10(vec_num1) 
 [1] 1.000000 1.000000 1.301030 1.477121 1.698970 1.903090 2.113943 2.322219
 [9] 2.531479 2.740363
log(vec_num1, base = 10)
 [1] 1.000000 1.000000 1.301030 1.477121 1.698970 1.903090 2.113943 2.322219
 [9] 2.531479 2.740363
  • square root:
sqrt(vec_num1)
 [1]  3.162278  3.162278  4.472136  5.477226  7.071068  8.944272 11.401754
 [8] 14.491377 18.439089 23.452079
  • power, e.g. 2:
vec_num1^2
 [1]    100    100    400    900   2500   6400  16900  44100 115600 302500
  • exponent (e to the power n (n = element of the vector):
exp(vec_num1)
 [1]  2.202647e+04  2.202647e+04  4.851652e+08  1.068647e+13  5.184706e+21
 [6]  5.540622e+34  2.872650e+56  1.591627e+91 4.572186e+147 7.277212e+238
4.1.8.1.2 Other usefull vector functions

Although R has many useful vector functions, I’ll introduce a couple of them here. To illustrate what they do, we’ll use

vec_num1 <- c(1, 2, 3, 4, 3, 2, 1)

cumsum(x) shows the cumulative sum of a vector. It’s first element is the first element of x; its second element is the sum of its first element and the second element of x; the third equals its second element (or the sum of the first two elements in x) plus the third element of x, … . If one of the elements is a missing value (NA), the rest of the sum will be set to NA.

cumsum(vec_num1)
[1]  1  3  6 10 13 15 16

As you can see, the second element is equal to 2 + 1, the first two elements in x. The third element, 6, is equal to the second element in the cumulative sum (3) and the third element in vec_num1 … .

cumprod(x) is a similar function but calculates the cumulative product.

cumprod(vec_num1)
[1]   1   2   6  24  72 144 144

cummax(x) and cummin(x) produce a vector with cumulative maximum and minimum values. The first starts with the first observation in x and use this as their first element. If the second elemen in x is larger than the first, the second element in the output vector for the cummax() function will equal that value; else is will equal its first value. The function then evaluates the third element in x. If that element is larger then the second element in the output vector for cummax() the third element in the cummax() vector will be that third element in the x vector; else the third element in the cummax() vector equals its second element. To see how this works:

cummax(vec_num1)
[1] 1 2 3 4 4 4 4

As you can see, the first element is 1. As the second element in vec_num1 is 2, this is a new maximum and the cummax() vector’s second element in 2? The same holds for the third element in vec_num1: it is larger than the second element in the cummax() vector, so this is a new maximum. The third element in the cummax() vector shows this. After the fourth element, all elements in vec_num1 are smaller then its maximum value. In the cummax() vector, the maximum is now stable.

cummin() is similar, but sets the minimum:

cummin(vec_num1)
[1] 1 1 1 1 1 1 1

The {purrr} package includes a function reduce() which is very useful with vectors. This function reduces elements of a vector in a single value using a 2-argument function that passes the accumulated value as this functions second argument. The cumsum() and cumprod() function’s last value equal the sum and product of all elements in the vector but also shows all intermediate cumulative sums. You can calculate that final value using purrr::reduce(.x, .f, ..., .init, .dir = c("forward", "backward")). The first argument, .x is an atomic vector. The second argument .f is a function that will be used across elements. This function needs to arguments: the first is an element from the vector; the second is the accumulated values from the previous step. The arguments .init and .dir = c("forward", "backward") show the initial value and the direction of the reduction with “forward” being the default. The default value for the initial value is the first element of x. To calculate the cumulative sum using this function:

purrr::reduce(vec_num1, .f = sum)
[1] 16

or even simpler:

purrr::reduce(vec_num1, `+`)
[1] 16

and the cumulative product:

purrr::reduce(vec_num1, `*`)
[1] 144

Note that this function is not limited to + or -, but can be used with, e.g. /

purrr::reduce(vec_num1, `/`)
[1] 0.006944444

If you only need the total sum of all vector elements, you can use sum(x, na.rm = FALSE):

sum(vec_num1)
[1] 16

Likewise, the product of all elements in a vector can be computed using prod(x, na.rm = FALSE):

prod(vec_num1)
[1] 144

Create a vector vec_1 as a sequence from 1 to 20

Code
vec_1 <- 1:10

Use this vector to

  • take the log, base 10:
Code
log(vec_1, base = 10)
 [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513 0.8450980
 [8] 0.9030900 0.9542425 1.0000000
Code
log10(vec_1)
 [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513 0.8450980
 [8] 0.9030900 0.9542425 1.0000000
  • multiply all elements with 2 and store the result in vec_2
Code
vec_2 <- vec_1 * 2
  • subtract vec_1 from vec_2
Code
vec_2 - vec_1
 [1]  1  2  3  4  5  6  7  8  9 10
  • calculate the cumulative sum and cumumative product of vec_1. Store the results in vec_1s and vec_1p:
Code
vec_1s = cumsum(vec_1)
vec_1p = cumprod(vec_1)
  • using this results, show the total sum and total product (1 value each) of vec_1. To do so, assume that you don’t know the number of columns in this vector.
Code
vec_1s[length(vec_1)]
[1] 55
Code
vec_1p[length(vec_1)]
[1] 3628800

Calculate the total sum and total produce of vec_1 in two other ways

  • sum of vec_1
Code
# Option 1

sum(vec_1)
[1] 55
Code
# Option 2: 

purrr::reduce(vec_1, sum)
[1] 55
  • product of vec_1
Code
# Option 1

prod(vec_1)
[1] 3628800
Code
# Option 2: 

purrr::reduce(vec_1, `*`)
[1] 3628800
4.1.8.1.3 Statistical functions
4.1.8.1.3.1 Distributions

The “r”-variants of the distribution functions such as rnorm were covered in a previous section. Here, we will (re-) introduce the other variants. Recall that we covered three. Applied to the normal distribution, these where pnorm(), dnorm() and qnorm(). We’ll use the vector vec_stat to illustrate these functions

vec_stat <- c(-1.959964, -1.64448, -1.281552, 0, 1.281552, 1.64448, 1.959964)
  • pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) shows the probability that a value is smaller then or equal to q, by default for a standard normal distribution. Changing the default value from lower.tail = TRUE to FALSE shows the probability that a value of larger then q. For vec_stat, these values are equal to
pnorm(q = vec_stat, lower.tail = TRUE)
[1] 0.02500000 0.05003855 0.09999992 0.50000000 0.90000008 0.94996145 0.97500000
pnorm(q = vec_stat, lower.tail = FALSE)
[1] 0.97500000 0.94996145 0.90000008 0.50000000 0.09999992 0.05003855 0.02500000
  • dnorm(x, mean = 0, sd = 1, log = FALSE) shows the probability of x
dnorm(x = vec_stat)
[1] 0.05844507 0.10319904 0.17549823 0.39894228 0.17549823 0.10319904 0.05844507
  • qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) for which value the condition holds that its probability (probability that a value is smaller than or equal to) is equal to the values in p. Applied to c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975). The default lower.tail = TRUE has the same interpretation as in pnorm():
qnorm(p = c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975), lower.tail = TRUE)
[1] -1.959964 -1.644854 -1.281552  1.281552  1.644854  1.959964
qnorm(p = c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975), lower.tail = FALSE)
[1]  1.959964  1.644854  1.281552 -1.281552 -1.644854 -1.959964

For all other function, Student’s t, Chi-square, uniform, F, you can apply similar functions.

In addition to these probability functions, there are many function that summarize a vector. These include function for central tendency and location (mean, median, …), for the level of dispersion and skewness. To illustrate these functions, we’ll use

vec_norm <- rnorm(100, 5, 10)
4.1.8.1.3.2 Central tendency and location

Here we will focus on functions that you can use to summarise the data: mean(x, trim = 0, na.rm = FALSE) calculates the mean of a vector. The second argument, trim = 0 can be used to remove observations at each end before computing the mean. For instance, trim = 0.10 would remove the smallest and largest 10% of all values and calculate the mean with the middle 80%. By default, this is 0. na.rm = FALSE tells are that it shouldn’t remove missing values. If the vector includes missing values, and the default FALSE is left, the result of this function will be NA.

mean(x = vec_norm, na.rm = TRUE)
[1] 4.414747
mean(x = vec_norm, trim = 0.10, na.rm = TRUE)
[1] 4.443764

The median(x = , na.rm = FALSE) function calculates the median. Here again, you need to specify how to handle missing observations.

median(vec_norm, na.rm = TRUE)
[1] 3.787437

Note that the median is a special case of quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE). This function allows you to compute the quantiles of a distribution. By default, the function calculates the minimum, the 25th percentile, the median (50th percentile), the 75th percentile and the maximum. You can see this in the probs = seq(0, 1, 0.25). Recall that seq(0, 1, 0.25) produces a vector (0, 0.25, 0.50, 0.75, 1). These probabilities correspond to default values. You can change this the default if you include your own values using, e.g. c(0.10, 0.25, 0.50, 0.75, 0.90). This option would show the 1st, 2nd and 3rd quartile (25th, 50th and 75th percentile) in addition to the 10th and 90th percentile. To see all deciles, you can use seq(0.10, 0.90, 0.10) as the value for probs. The last options tells R it needs to add names to the values (e.g. Min, 1st Qu, Median, 10% …). If you set this value to FALSE, these names are dropped. If you save these results in a new vector, you can subset them using both the subsetting methods for named and unnamed vectors. To see the 10th and 90th percentile as well as the 1st, 2nd and 3th quartile of vec_stat:

vec_quan <- quantile(x = vec_stat, probs = c(0.10, 0.25, 0.50, 0.75, 0.90), na.rm = TRUE, names = TRUE)
vec_quan
      10%       25%       50%       75%       90% 
-1.770674 -1.463016  0.000000  1.463016  1.770674 

You can now subset vec_quan:

vec_quan[1]
      10% 
-1.770674 
vec_quan["75%"]
     75% 
1.463016 

To see the minimum and maximum values, you can use min() and max(). Other than a vector, these functions allow you to set the default na.rm from false into TRUE:

min(vec_norm, na.rm = TRUE)
[1] -15.86817
max(vec_norm, na.rm = TRUE)
[1] 27.46498

The summary() function shows the mean and median as well as the minimum, maximum and the 1st and 3rd quartile. This function returns a table. If you save the results, you can subset this table using the traditional subsetting rules for named and unnamed vectors.

tab_sum <- summary(vec_norm)
tab_sum
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-15.868  -3.233   3.787   4.415  12.944  27.465 

If you use a name to subset, note that the name of some summary statistics include a ‘.’ at the end:

tab_sum["3rd Qu."]
 3rd Qu. 
12.94374 
4.1.8.1.3.3 Measures of dispersion

Often used measure if dispersion include the range, the minimum and maximum values; the interquartile range or the difference between the 3rd and 1st quartile, the variance and standard deviation. To use the range() function, you need to supply it with the a vector. The other argument is na.rm = FALSE by default. The function shows the minmum and maxium value. Note that these statistics are also includes in e.g. summary(), min() and max() and you can also select them using quantiles().

range(vec_norm, na.rm = TRUE)
[1] -15.86817  27.46498

To calculate the interquartile range of IQR, you can use IQR(). The most important arguments of this function include the vector and na.rm:

IQR(x = vec_norm, na.rm = TRUE)
[1] 16.177

To compute the variance function you can use var(x, na.rm = FALSE). You can calculate the standard deviation either as the square root of the variance or using sd(x, na.rm = FALSE). In both functions, x is the vector whose variance or standard deviation you need to compute;

var(x = vec_norm, na.rm = TRUE)
[1] 104.2964
sqrt(var(x = vec_norm, na.rm = TRUE))
[1] 10.21256
sd(x = vec_norm, na.rm = TRUE)
[1] 10.21256
4.1.8.1.3.4 Higher order moments: skewness and kurtosis

To calculate moments larger then 2, you can use the {moments} package. This package includes functions such as skewness() and kurtosis(). You can use these to calculate the third and fourth moment of the distribution. For higher order moments, you can use moment(x, order = 1, central = FALSE, absolute = FALSE, na.rm = FALSE). The order = argument allows you to set the order (e.g. 2 for variance, 3 for skewness, …). To set moments around the mean (e.g. like you would do to calculate the variance), set central = TRUE. To use this package, you have to install it first.

Create a vector, vec_rn with 100 draws from a normal distribution with mean 5 and standard deviation 5 and vec_rt with 100 draws from Student’s t-distribution with 10 degree of freedom

Code
vec_rn <- rnorm(100, 5, 5)
vec_rt <- rt(100, 10)

Determine the probability that you find values larger than each of the elements in c(1.65, 1.75, 2.10) if these values follow a t-distribution with 5 degrees of freedom.

Code
pt(c(1.65, 1.75, 2.10), df = 5, lower.tail = FALSE)
[1] 0.07992788 0.07026118 0.04487662

For a Chi-square distribution with 10 degrees of freedom, determine the values for which holds that the probabilities that you find a value smaller than or equal to that value are equal to c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975)

Code
qchisq(c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975), df = 10)
[1]  3.246973  3.940299  4.865182 15.987179 18.307038 20.483177

Using vec_rn determine:

  • the mean (include the possibility that there are missing values):
Code
mean(vec_rn, na.rm = TRUE)
[1] 5.310853
  • the mean if you remove the 10% lowest and 10% highest values:
Code
mean(vec_rn, trim = 0.10, na.rm = TRUE)
[1] 5.312695
  • the median:
Code
median(vec_rn, na.rm = TRUE)
[1] 5.104114
  • quantiles:
Code
quantile(vec_rn, na.rm = TRUE)
       0%       25%       50%       75%      100% 
-5.807568  1.486231  5.104114  8.714391 15.770025 
  • minimum and maximum:
Code
min(vec_rn)
[1] -5.807568
Code
max(vec_rn)
[1] 15.77002
  • range:
Code
range(vec_rn, na.rm = TRUE)
[1] -5.807568 15.770025
  • Interquartile distance:
Code
IQR(vec_rn, na.rm = TRUE)
[1] 7.228161
  • standard deviation and variance:
Code
sd(vec_rn, na.rm = TRUE)
[1] 5.049125
Code
var(vec_rn, na.rm = TRUE)
[1] 25.49367
4.1.8.1.4 Rounding

Rounding numeric values uses round(), floor(), ceiling(), trunc() or signif(). Applied to a numeric vector, these function output a vector with rounded data. To illustrate, let’s first take of natural logarithm of vec_num1 and use this vector to show how these functions work.

vec_num3 <- log(vec_num1)
  • round(x, digits = 0): rounds x to n decimal places. With n = 2
round(vec_num3, digits = 2)
[1] 0.00 0.69 1.10 1.39 1.10 0.69 0.00
  • `floor(x): rounds to the largest integer, not greater than the value in x:
floor(vec_num3)
[1] 0 0 1 1 1 0 0
  • ceiling(x): rounds to the smallest integer not less than the value in x:
ceiling(vec_num3)
[1] 0 1 2 2 2 1 0
  • trunc(x): removes all decimal places:
trunc(vec_num3)
[1] 0 0 1 1 1 0 0
  • signif(x, digits = 6) rounds values in x to the specified number of significant digits. Applied to c(123456, 654321, 147258, 852147):
signif(x = c(123456, 654321, 147258, 852147), digits = 4)
[1] 123500 654300 147300 852100
4.1.8.1.5 Boolean operators

Boolean operators work element wise. For instance, to check if the values in vec_num1 are larger than 50:

vec_num1 > 50
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Recall that we used this observation to subset a vector. There are two other useful function to apply to vectors: any()and all(). The first checks if at least one of the values is TRUE. The other is all values are TRUE. For instance, to check is any of the values in vec_num are larger than 450:

any(vec_num1 > 450, na.rm = TRUE)
[1] FALSE

You can use all()to check is a conditions holds for all elements in a vector. For instance, to see if all elements are positive, you can use

all(vec_num1 > 1, na.rm = TRUE)
[1] FALSE

4.1.9 Functions operating on character vectors

In Chapter 3, we introduced character functions. Here, we will see how they can be used with character vectors. Note that most {stringr} function return a list. We will meet lists in this chapter and see how you can use them in your analysis. Here, we will only use their properties if needed. Most base R functions output a vector. Sometimes, this makes them easier to use.

We already looked at paste() and paste0(). You can use these functions to generate a series of numbers as characters:

paste(1:5)
[1] "1" "2" "3" "4" "5"

If you apply these function to vectors including several character values, you can change the collapse = NULL to show these character values in 1 string. To see what these functions do, let’s use:

vec_char1 <- c("dog", "cat", "fish")

Let’s first use paste()

paste(vec_char1, collapse = " __ ")
[1] "dog __ cat __ fish"

As you can see, the three elements of the character vector are now one character, seperated by “__”. If you use paste0() you have the same result:

paste0(vec_char1, collapse = "**")
[1] "dog**cat**fish"

Both function also allow you to create variable names such as “var_1”, “var_2”, …

paste("var", 1:5, sep = "_")
[1] "var_1" "var_2" "var_3" "var_4" "var_5"
paste0("var", 1:5)
[1] "var1" "var2" "var3" "var4" "var5"

Note that here you leave the collapse default, as you want to keep the various character values as separate values in a character vector. For instance, because you will use them as names in a dataset.

When we introduced string variables, we introduced regular expressions. Recall that you can include these regular expressions in {stringr}’s function as well as in e.g. grep() or grepl(). These are especially useful if you have a character vector. Let’s define a character vector first. For instance, a list of product codes could look like

vec_char3 <- c("25-78T", "25-98S", "45-97Q", "45-74Q", "45-72T", "55-10T", "55-48T", "55-69Q", "55+178T", 
               "173+8W", "235+9W", "125+1W", "274+2Q", "274+5Q", "751+9Q", "274+1W", "274+4W", "274+5Q", "751+6Q")

Suppose that you need to extract all product codes that start with 55 and end with T. As you can see in the character vector, there are two matches: “55-10T”, “55-48T” and 55+178T. You don’t want to extract e.g. “55-69-Q” or “25-68-T”. This is where regular expressions are very useful. Recall that regular expressions allow you to identify patterns using e.g. “[A-Z][a-z]+”, “\\d” or “[0-9]{4}”. The first searches for pattern in characters that start with a capital letter A to Z and are followed by one or more normal letters a to z. The second searches for numbers and the third for numbers with exact 4 repetitions. In the second, we use the backslash “\”. Recall that is is the escape operator for regular expressions. Using “d” in a regular expression would search for the occurrence of the letter “d”. To tell R it has to look for digits, we need to “escape” the usual meaning of “d”. To do so, we use “\” to escape the usual interpretation “d”. As “\” also has a specific meaning in R - it is the escape character - we need to tell R not to use the literal interpretation of “\” but interpret it as an escape character. This is where the second backslash enters. In other words the first backslash on the left “\\d” tells R to escape the literal interpretation of the second backslash from the left. The second backslash is now an escape character. With that escape character, the interpretation of “d” changes: we don’t want the literal interpretation of “d” but the “d” as referring to all digits interpretation. The same holds e.g. if you want to escape the literal interpretation of a dot “.”: you need to use “\\ .”. Doing so, R will not look for a “.” but will find any character.

If you want to avoid backslashes for special character such as ., *, {, }, +, ^, $, |, ?, (, ), you can include them between square brackets [] . For instance [.] or [ ?] will look for a literal “.” or “?”.

To search for a specific word or part of a word, you can use that word or its part. Suppose that you have a character vector

vec_char1 <- c("pineapple", "strawberry", "blackberry", "apple", "banana", "grapefruit", "melon", "cranberry", "kiwi", "lemon")

To extract all elements including “berry”, you can use e.g. grep(). Recall that this function’s first argument is the pattern, the second the character vector where grep() will look to pattern matches, the option to set ignore.case = FALSE to TRUE and the option to show the value as opposed to the position by changing the default value = FALSE to TRUE. To show the pattern match, you’ll also see the output from {stringr}’s str_view(x, pattern, match = NA) function. This functions shows the matches in the character vector. Changing match = NA in match = TRUE limits the output from this function to the matched strings. Adding html = TRUE shows an html widget in the Viewer tab of the files pane.

grep(pattern = "berry", x = vec_char1, value = TRUE)
[1] "strawberry" "blackberry" "cranberry" 
stringr::str_view(vec_char1, "berry", match = NA )
 [1] │ pineapple
 [2] │ straw<berry>
 [3] │ black<berry>
 [4] │ apple
 [5] │ banana
 [6] │ grapefruit
 [7] │ melon
 [8] │ cran<berry>
 [9] │ kiwi
[10] │ lemon

To see the position of these values:

grep(pattern = "berry", x = vec_char1, value = FALSE)
[1] 2 3 8
stringr::str_view(vec_char1, "berry", match = NA )
 [1] │ pineapple
 [2] │ straw<berry>
 [3] │ black<berry>
 [4] │ apple
 [5] │ banana
 [6] │ grapefruit
 [7] │ melon
 [8] │ cran<berry>
 [9] │ kiwi
[10] │ lemon

Including the pattern “an” shows all matches that include the letter “an”:

grep(pattern = "an", x = vec_char1, value = TRUE)
[1] "banana"    "cranberry"
stringr::str_view(vec_char1, "an", match = NA)
 [1] │ pineapple
 [2] │ strawberry
 [3] │ blackberry
 [4] │ apple
 [5] │ b<an><an>a
 [6] │ grapefruit
 [7] │ melon
 [8] │ cr<an>berry
 [9] │ kiwi
[10] │ lemon

Using “|” you can add various patterns. For instance, using “an|be” matches all elements in the string that include “an” or “be”.

grep(pattern = "an|be", x = vec_char1, value = TRUE)
[1] "strawberry" "blackberry" "banana"     "cranberry" 
stringr::str_view(vec_char1, "an|be", match = NA)
 [1] │ pineapple
 [2] │ straw<be>rry
 [3] │ black<be>rry
 [4] │ apple
 [5] │ b<an><an>a
 [6] │ grapefruit
 [7] │ melon
 [8] │ cr<an><be>rry
 [9] │ kiwi
[10] │ lemon

In these examples, the pattern is fixed: “berry”. However, often the patterns are less clear. Regular expressions allow you to build more complex patterns.

First you can combine one or more characters (letter, numbers, symbols) sets with qualifiers to control for the number of occurrences and anchors that determine where a pattern occurs. Let’s start with the first. A character set is included between []. For instance, [rst] matches lowercase “r”, “s”, or “t”; [aeiou] matches all lowercase vowels “a”, “e”, “i”, “o” or “u”, [IJK] matches all uppercase “I”, “J”, or “K” and [aBcD] matches lowercase “a” or “c” or uppercase “B” or “D”. In a similar way you can match numbers. For instance [0123] matches “0”, “1”, “2” or “3”. You can include letters and numbers in your set: [a1B2c] matches “a”, “1”, “B”, “2” or “c”. Note that you can combine character sets in one regular expression. For instance [aeiou]b[aeiou] searches for a pattern: a vowel, the letter b and another vowel. For instance, to search for the pattern: any letter from “m”, “n”, “o”, “p”, “q” or “r” followed by an “a” followed by any letter from “m”, “n”, “o”, “p”, “q” or “r”:

grep("[mnopqr]a[mnopqr]", vec_char1, value = TRUE)
[1] "banana"     "grapefruit" "cranberry" 
stringr::str_view(vec_char1, "[mnoprq]a[mnopqr]", match = NA)
 [1] │ pineapple
 [2] │ strawberry
 [3] │ blackberry
 [4] │ apple
 [5] │ ba<nan>a
 [6] │ g<rap>efruit
 [7] │ melon
 [8] │ c<ran>berry
 [9] │ kiwi
[10] │ lemon

To include special characters in the set, you need the escape character. For instance [a-z\.] matches “a” to “z” as well as “.”, [$] matches “$” and [{}] matches “{” or “}”.

If you use - within your character set, you can define a range: [a-z] matches all lower case letters starting from a and running across the alphabet to z. If you change the “a” or “z” you can restrict the range, e.g. [k-n] matches “k”, “l”, “m”, or “n” Using uppercase allows you to define a range of uppercase letters: [B-E] matches “B”, “C”, “D” or “E”. Adding both, e.g. [a-zA-A] or [A-Za-Z] matches any character “a” to “z” or “A” to “Z”. Using numbers, [0-9] matches all numbers from 0 to 9 while [1-3] matches all numbers from 1 to 3. To see how these ranges work, let’s look for the pattern: any letter from “a” to “m” followed by an “e” followed by any letter from “n” to “z” in vec_char1:

grep("[a-m]e[n-z]", vec_char1, value = TRUE)
[1] "strawberry" "blackberry" "cranberry" 
stringr::str_view(vec_char1, "[a-m]e[n-z]", match = NA)
 [1] │ pineapple
 [2] │ straw<ber>ry
 [3] │ black<ber>ry
 [4] │ apple
 [5] │ banana
 [6] │ grapefruit
 [7] │ melon
 [8] │ cran<ber>ry
 [9] │ kiwi
[10] │ lemon

Including the carat sign “^” within a character set works as a negation. For instance [^a-k] matches all lowercase letters except “a” to “k”, [^qrt] matches all letters except “q”, “r” or “t”. Used with digits, [^3-9] matches all except “3” to “9”. To see how this works, the use the carat sign in the previous regular expression: “[^a-m]e[^n-z]” a letter not from “a” to “m” followed by an “e” followed by a letter not from “n” to “z”:

grep("[^a-m]e[^n-z]", vec_char1, value = TRUE)
[1] "pineapple"  "grapefruit"
stringr::str_view(vec_char1, "[^a-m]e[^n-z]", match = NA)
 [1] │ pi<nea>pple
 [2] │ strawberry
 [3] │ blackberry
 [4] │ apple
 [5] │ banana
 [6] │ gra<pef>ruit
 [7] │ melon
 [8] │ cranberry
 [9] │ kiwi
[10] │ lemon

In addition to these character sets, there are meta characters and shortcuts that have their own meaning. With respect to the metacharacters, you can use “.” to refer to any single character. In other words “a..b” matches all patterns that start with a, and with b and have two characters between them. For instance if you want to find matches “a..e” in vec_char1 R will look at all occurrences of “a” followed by 2 other characters and ending with “e”:

grep("a..e", vec_char1, value = TRUE)
[1] "strawberry" "cranberry" 
stringr::str_view(vec_char1, "a..e", match = NA)
 [1] │ pineapple
 [2] │ str<awbe>rry
 [3] │ blackberry
 [4] │ apple
 [5] │ banana
 [6] │ grapefruit
 [7] │ melon
 [8] │ cr<anbe>rry
 [9] │ kiwi
[10] │ lemon

With respect to the shortcuts, they include:

  • \d : any digit character, 0, 1, 2 …
  • \D : any non digit character: letters, question marks, spaces, …
  • \w : any alphanumeric character
  • \W : any non-alphanumeric character (symbols, punctuation, …)
  • \s : a whitespace including space and tab
  • \S : any non-whitespace

Note that there you need the escape character. Using these shortcuts, you can replace [0-9] with \d, search for whitespaces using \s, … . As an example:

  • matching a any digit:
grep("\\d", c("125", "abc! ", " "), value = TRUE)
[1] "125"
grep("[0-9]", c("125", "abc! ", " "), value = TRUE)
[1] "125"
stringr::str_view(c("125", "abc! ", " "), "\\d", match = NA)
[1] │ <1><2><5>
[2] │ abc! 
[3] │  
  • matching any non-digit:
grep("\\D", c("125", "abc! ",  " "), value = TRUE)
[1] "abc! " " "    
stringr::str_view(c("125", "abc! ", " "), "\\D", match = NA)
[1] │ 125
[2] │ <a><b><c><!>< >
[3] │ < >
  • matching any alfanumeric character:
grep("\\w", c("125", "abc! ", " "), value = TRUE)
[1] "125"   "abc! "
stringr::str_view(c("125", "abc! ", " "), "\\w", match = NA)
[1] │ <1><2><5>
[2] │ <a><b><c>! 
[3] │  
  • matching any non-alfanumeric character:
grep("\\W", c("125", "abc! "), value = TRUE)
[1] "abc! "
stringr::str_view(c("125", "abc! ", " "), "\\W", match = NA)
[1] │ 125
[2] │ abc<!>< >
[3] │ < >
  • matching a whitespace:
grep("\\s", c("125", "abc! ",  " "), value = TRUE)
[1] "abc! " " "    
stringr::str_view(c("125", "abc! ", " "), "\\s", match = NA)
[1] │ 125
[2] │ abc!< >
[3] │ < >
  • matching any non-whitespace:
grep("\\S", c("125", "abc! ",  " "), value = TRUE)
[1] "125"   "abc! "
stringr::str_view(c("125", "abc! ", " "), "\\S", match = NA)
[1] │ <1><2><5>
[2] │ <a><b><c><!> 
[3] │  

Anchors determine the location of a pattern. Using the carat \^ the pattern needs to be located at the start of the string. In other words ^r will match pattern starting with an “r”. Note here the difference in result if \^ is used withing [ ] and before a string. Within square brackets, it works to exclude the letters or numbers withing the square brackets. Starting a string with the carat sign, works to determine the position of a character. Ending a pattern with a \$ means that the pattern should be at the end of a string. In other words y$ matches a y at the end of the string. Using \b, you locate at pattern at the end of a word (e.g. before a space, dash, comma, semi colon, dot, …) while \B matches any non-word boundary:

  • ^ : the string starts with the expression following ^,
  • $ : the string ends with the expression before $
  • \b : matches a word boundary (space, dash, comma, semi colon, …)
  • \B : matches a non-word boundary (\w-\w or \W-\W)

For example:

  • matching a pattern at the start of a string:
grep("^b", vec_char1, value = TRUE)
[1] "blackberry" "banana"    
stringr::str_view(vec_char1, "^b", match = NA)
 [1] │ pineapple
 [2] │ strawberry
 [3] │ <b>lackberry
 [4] │ apple
 [5] │ <b>anana
 [6] │ grapefruit
 [7] │ melon
 [8] │ cranberry
 [9] │ kiwi
[10] │ lemon
  • matching a pattern at the end of a string:
grep("y$", vec_char1, value = TRUE)
[1] "strawberry" "blackberry" "cranberry" 
stringr::str_view(vec_char1, "y$", match = NA)
 [1] │ pineapple
 [2] │ strawberr<y>
 [3] │ blackberr<y>
 [4] │ apple
 [5] │ banana
 [6] │ grapefruit
 [7] │ melon
 [8] │ cranberr<y>
 [9] │ kiwi
[10] │ lemon
  • matching a pattern at the end of a word boundary:
grep("e\\b", c("average costs", "total sales", "total revenues"), value = TRUE)
[1] "average costs"
stringr::str_view(c("average costs", "total sales", "total revenues"), "e\\b", match = NA)
[1] │ averag<e> costs
[2] │ total sales
[3] │ total revenues
  • matching a pattern with a non-word boundary:
grep("s\\B", c("average costs", "total sales"), value = TRUE)
[1] "average costs" "total sales"  
stringr::str_view(c("average cost", "total sales", "total revenues"), "s\\B", match = NA)
[1] │ average co<s>t
[2] │ total <s>ales
[3] │ total revenues

Using word boundaries, you can match occurrences of e.g. individual numbers not included in another one. For instance, to identify the number “2” as a number not included in e.g. “125” or “210” you can use

grep("\\b2\\b", c("125", "2", "210"), value = TRUE)
[1] "2"

Quantifiers control the number of occurrences of a pattern. Using \+ at the end of a pattern means that this pattern can be repeated once or more times. These repetitions can occur throughout the string if they are not followed by another part of the regular expression. For instance “[a-z]\+” means that a letter from “a” to “z” can occur once but also multiple times. Using \* is used when a pattern doesn’t have to occur or could occur with one of multiple repetitions. With \? you need at most one repetition. In other words, the pattern before \? is optional. Using \{x\} fixed the number of repetitions to x while \{x, y\} sets the number of repetition between x or y.

  • + : refers to one of more repetitions
  • * : refers to zero or more repetitions (i.e. it doesn’t occur but it can also occur multiple times).
  • ? : at most 1 repetitions
  • {x} : in case you want to include the number of repetitions
  • {x, } : at least x repetitions
  • {x, y} : in case there are x to y repetitions.

Here are a couple of examples:

  • matching one or more repetitions:
grep("p+", vec_char1, value = TRUE)
[1] "pineapple"  "apple"      "grapefruit"
stringr::str_view(vec_char1, "p+", match = NA)
 [1] │ <p>inea<pp>le
 [2] │ strawberry
 [3] │ blackberry
 [4] │ a<pp>le
 [5] │ banana
 [6] │ gra<p>efruit
 [7] │ melon
 [8] │ cranberry
 [9] │ kiwi
[10] │ lemon
  • matching an optional character (here the Q is optional)
grep("abcQ?abc", c("abcQabc", "abcabc", "abc_abc"), value = TRUE)
[1] "abcQabc" "abcabc" 
stringr::str_view(c("abcQabc", "abcabc", "abc_abc"), "abcQ?abc", match = NA)
[1] │ <abcQabc>
[2] │ <abcabc>
[3] │ abc_abc
  • matching exact two repetitions
grep("p{2}", vec_char1, value = TRUE)
[1] "pineapple" "apple"    
stringr::str_view(vec_char1, "p{2}", match = NA)
 [1] │ pinea<pp>le
 [2] │ strawberry
 [3] │ blackberry
 [4] │ a<pp>le
 [5] │ banana
 [6] │ grapefruit
 [7] │ melon
 [8] │ cranberry
 [9] │ kiwi
[10] │ lemon

If you have longer character vectors that include various lines, you can identify every new line or a tab using:

  • \n : A new line
  • \t : A tab

An expression between parentheses () forms a group. This allows you e.g. to apply a quantifiers to that group. For instance, let’s use the pattern “(na)+” to find matches in vec_char1:

grep("(na)+", vec_char1, value = TRUE)
[1] "banana"
stringr::str_view(vec_char1, "(na)+", match = NA)
 [1] │ pineapple
 [2] │ strawberry
 [3] │ blackberry
 [4] │ apple
 [5] │ ba<nana>
 [6] │ grapefruit
 [7] │ melon
 [8] │ cranberry
 [9] │ kiwi
[10] │ lemon

As you can see there is one match: in banana. This is useful because it allows you to shorten some regular expressions. For instance, suppose that you are looking for a pattern: “3 letters, a number, 3 letters, a number” e.g. abc1csb2 there are two ways to write this regular expression. The first “[a-z]{3}\d[a-z]{3}\d”. The second, using parenthesis: “([a-z]{3}\d){2}”. Using parenthesis, you repeat the part within the group twice.

Note that you can store regular expressions in an object. For instance,

pat_1 <- "abc|def"

stores a regular expression you can re-use:

grep(pat_1, c("abc", "def", "ghi"), value = TRUE)
[1] "abc" "def"

This allows you to generate patterns from code.

Let’s now use these regular expressions to extract characters from a character vector. Let’s first start with

vec_char2 <- c("usd 25", "eur 35", "USD 36", "EUR 88", "Usd 4700", "Eur 18723", "$25522", "€140")

Here, you can see that all strings in the character vector refer to a currency, the usd or eur, but that these references are written in multiple ways. To work with the numbers, we need to extract the currency and the currency and store each is a separate variable. Let’s stick to regular expressions (you could e.g. tolower() to change of uppercase currency in lowercase and gsub() to replace all occurrences of \$ and € with “usd” or “eur”. Using {stringr}’s str_extract_all() to extract the currencies.

stringr::str_extract_all(vec_char2, "[A-Za-z]{3}|€|\\$")
[[1]]
[1] "usd"

[[2]]
[1] "eur"

[[3]]
[1] "USD"

[[4]]
[1] "EUR"

[[5]]
[1] "Usd"

[[6]]
[1] "Eur"

[[7]]
[1] "$"

[[8]]
[1] "€"

Now use the same function to extract the numbers.

stringr::str_extract_all(vec_char2, "\\d+")
[[1]]
[1] "25"

[[2]]
[1] "35"

[[3]]
[1] "36"

[[4]]
[1] "88"

[[5]]
[1] "4700"

[[6]]
[1] "18723"

[[7]]
[1] "25522"

[[8]]
[1] "140"

str_extract_all() returns a list. You can access the elements of that list using the subsetting operators for a list. For instance, to show the value for the second outcome and return a numeric value, you would use:

outcome <- stringr::str_extract_all(vec_char2, "\\d+")
as.numeric(outcome[[2]][1])
[1] 35

As an alternative, you can simplify these results. To do so, you add simplify = TRUE as an argument to the str_extract_all() function

outcomes <- stringr::str_extract_all(vec_char2, "\\d+", simplify = TRUE)
outcomes |> as.numeric()
[1]    25    35    36    88  4700 18723 25522   140

You can now subset these results using the usual subsetting operators.

Suppose that student numbers are written as “r2024-000125-B”. Here the pattern is “lowercase r; followed by academic year; followed by -; followed by 6 digits; followed by - and ends with a uppercase which can be any uppercase letter”. Write a regular expression that identifies these numbers in

char_stud <- c("r2024-000125-B", "r2024-005524-L", "r2024-00014-5", "r2024-1000140-C")

Note that only the first two are correct.

grep("r\\d{4}-\\d{6}-[A-Z]", char_stud, value = TRUE)
[1] "r2024-000125-B" "r2024-005524-L"

How would you change this regular expression if the part in the middle could be 6 or 7 digits? If that is the case, in addition to the first two, the last number should also match.

grep("r\\d{4}-\\d{6,7}-[A-Z]", char_stud, value = TRUE)
[1] "r2024-000125-B"  "r2024-005524-L"  "r2024-1000140-C"

Recall that dates are written as “yyyy-mm-dd”. Write a regular expression that actual dates in the following character vector.

vec_dat <- c("2025-03-20", "2025-03-08", "1998-11-11", "2025-24-33", "2025-19-54")

Note that only the first 3 are correct.

grep("\\d{4}-[0-1][0-9]-[0-3][0-9]", vec_dat, value = TRUE)
[1] "2025-03-20" "2025-03-08" "1998-11-11"

Here you have some sentences. Using {stringr}’s str_count() the number of times the letters “the” occur in words but excluding the word “the” (e.g. thesis, these, they)

vec_quote <-c("The thesis was written by 2 students.", 
              "These students were in the same group for mathematics.",
              "The first part of their work included their theory.",
              "They had to apply statistics to test their hypothesis.",
              "These tests were done in R.",
              "To collect their data, they had to visit a theater.")

First write the regular expression to match these words:

stringr::str_view(vec_quote, "(([A-Za-z]?)+(T|t)he)[a-z]+", match = NA)
[1] │ The <thesis> was written by 2 students.
[2] │ <These> students were in the same group for <mathematics>.
[3] │ The first part of <their> work included <their> <theory>.
[4] │ <They> had to apply statistics to test <their> <hypothesis>.
[5] │ <These> tests were done in R.
[6] │ To collect <their> data, <they> had to visit a <theater>.

Use str_count() to count the number of matches per sentence:

stringr::str_count(vec_quote, "(([A-Za-z]?)+(T|t)he)[a-z]+")
[1] 1 2 3 3 1 3
# Check the words: the is included in thesis, these, mathematics, their, theory
# hypothesis, they and theater. T can be both uppercase and lowercase
# ((T|t)he): a group of letters allowing for The as well as the
# part before this group: ([a-z]?)+: optional number of upper or lowercase
# letters: upper of lowercase: [A-Za-z], optional: ? can be repeated as a group
# part after ((T|t)he): any series of letters, lowercase: [a-z]+

Here, str_count shows the result in a vector

Let’s return to vec_char3

vec_char3
 [1] "25-78T"  "25-98S"  "45-97Q"  "45-74Q"  "45-72T"  "55-10T"  "55-48T" 
 [8] "55-69Q"  "55+178T" "173+8W"  "235+9W"  "125+1W"  "274+2Q"  "274+5Q" 
[15] "751+9Q"  "274+1W"  "274+4W"  "274+5Q"  "751+6Q" 

and try to write a regular expression that matches all product codes that start with 55 and end with T. To define the start, we can use the carat sign: “^55”. As you can see from the product codes, “55” is followed by other characters. Sometimes it 3 “e.g. ”-10” in another occasion it is 4 “+178”. To allows for this repetition of any sign, we will use \.: any character and allow for one or more repetitions. The last part includes the “T”. In a regular expression, that is “T$”. With the regular expression, we can now extract the product codes:

grep(pattern = "^55.+T$", vec_char3, value = TRUE)
[1] "55-10T"  "55-48T"  "55+178T"
stringr::str_view(vec_char3, "^55.+T$", match = NA)
 [1] │ 25-78T
 [2] │ 25-98S
 [3] │ 45-97Q
 [4] │ 45-74Q
 [5] │ 45-72T
 [6] │ <55-10T>
 [7] │ <55-48T>
 [8] │ 55-69Q
 [9] │ <55+178T>
[10] │ 173+8W
[11] │ 235+9W
[12] │ 125+1W
[13] │ 274+2Q
[14] │ 274+5Q
[15] │ 751+9Q
[16] │ 274+1W
[17] │ 274+4W
[18] │ 274+5Q
[19] │ 751+6Q

Using the following vector, you will have to write regular expression to extract elements of that vector. You can use stringr::str_view(vec, pattern) to see if your regular expression is successful in matching the required outcome. In the folded code, this function is included to show the pattern matches. The folded code also assigns the patters to pat to use in the function calls.

vec <- c("+32 123 456789", "0032 123 456798", "+32 012345679", "rqx_47-87+5", "rqx_47-87+6", "rqx_47-86+5", "rpts_47-86+5", "usd 25", "eur 36")
  • using grep() extract all location where you can find a cell phone number. This numbers starts with +32 or 0032 and is followed by 3 digit a space and 6 digits. Some people forget to include that second space and add 9 digits after 32. In vec, the first three elements are correct numbers, the others aren’t.
Code
pat <- ".+32\\s[0-9]{3}\\s?[0-9]{6}"
stringr::str_view(vec, pat, match = NA)
[1] │ <+32 123 456789>
[2] │ <0032 123 456798>
[3] │ <+32 012345679>
[4] │ rqx_47-87+5
[5] │ rqx_47-87+6
[6] │ rqx_47-86+5
[7] │ rpts_47-86+5
[8] │ usd 25
[9] │ eur 36
Code
grep(pat, vec, value = TRUE)
[1] "+32 123 456789"  "0032 123 456798" "+32 012345679"  
  • Use this pattern to extract these values from vec and store the result in vec_phone:
Code
vec_phone <- vec[grepl(pat, vec)]
vec_phone
[1] "+32 123 456789"  "0032 123 456798" "+32 012345679"  
  • extract all values from vec that include a currency. Write your code in such a way that “yen”, “gbp” and “sek” would also be extracted if included.
Code
pat <- "usd|eur|yen|gdp|sec"
stringr::str_view(vec, pat, match = NA)
[1] │ +32 123 456789
[2] │ 0032 123 456798
[3] │ +32 012345679
[4] │ rqx_47-87+5
[5] │ rqx_47-87+6
[6] │ rqx_47-86+5
[7] │ rpts_47-86+5
[8] │ <usd> 25
[9] │ <eur> 36
Code
grep(pat, vec, value = TRUE)
[1] "usd 25" "eur 36"
  • use this outcome to split the currency from the value. Start from the previous result and use the pipe operator in a {stringr} function and simplify the results:
Code
grep(pat, vec, value = TRUE) |> stringr::str_split(pattern = " ", simplify = TRUE)
     [,1]  [,2]
[1,] "usd" "25"
[2,] "eur" "36"

Here, you have a matrix. You can extract the values using matrix subsetting operators. These will be introduced in this chapter.

  • use vec and extract all values that include include “47” after the initial letters and end with “+5”. Use str_extract_all() and simplify the results. Write your code using the pipe operator
Code
pat <- "([a-z]+)?_47.+\\+5"
stringr::str_view(vec, pat, match = NA)
[1] │ +32 123 456789
[2] │ 0032 123 456798
[3] │ +32 012345679
[4] │ <rqx_47-87+5>
[5] │ rqx_47-87+6
[6] │ <rqx_47-86+5>
[7] │ <rpts_47-86+5>
[8] │ usd 25
[9] │ eur 36
Code
vec |> stringr::str_extract_all(pat, simplify = TRUE)
      [,1]          
 [1,] ""            
 [2,] ""            
 [3,] ""            
 [4,] "rqx_47-87+5" 
 [5,] ""            
 [6,] "rqx_47-86+5" 
 [7,] "rpts_47-86+5"
 [8,] ""            
 [9,] ""            

Using the following paragraph from a reuters article

reuters <- "The pound headed for its worst weekly performance against the euro in over two years on Friday, as a boost to European spending drove a broad rally in the single currency, while against the dollar, sterling rose ahead of U.S. jobs data. The euro has surged across the board this week, logging its best weekly performance against the dollar since March 2009. Against the pound, it was set for a weekly gain of 1.5%, the most since January 2023. It was last up 0.4% at 84.03 pence. The pound was up 0.4% against the dollar at $1.292."
  • verify if this article includes references to “pound” or “sterling” (note that both could be with and without uppercase “P” or “S”). Use str_detect() to do so.
Code
pat <- "pound|Pound|sterling|Sterling"
stringr::str_view(reuters, pat, match = NA)
[1] │ The <pound> headed for its worst weekly performance against the euro in over two years on Friday, as a boost to European spending drove a broad rally in the single currency, while against the dollar, <sterling> rose ahead of U.S. jobs data. The euro has surged across the board this week, logging its best weekly performance against the dollar since March 2009. Against the <pound>, it was set for a weekly gain of 1.5%, the most since January 2023. It was last up 0.4% at 84.03 pence. The <pound> was up 0.4% against the dollar at $1.292.
Code
stringr::str_detect(reuters, pat)
[1] TRUE
  • determine the position of the occurrences of “pound” or “sterling” (including uppercase “P” or “S”)
Code
stringr::str_locate_all(reuters, pat)
[[1]]
     start end
[1,]     5   9
[2,]   199 206
[3,]   371 375
[4,]   485 489
  • how many times does the article refer to “pound” or “sterling” (incuding uppercase “P” or “S”)
Code
stringr::str_count(reuters, pat)
[1] 4
  • break this article in sentences:
Code
stringr::str_split(reuters, stringr::boundary("sentence"))
[[1]]
[1] "The pound headed for its worst weekly performance against the euro in over two years on Friday, as a boost to European spending drove a broad rally in the single currency, while against the dollar, sterling rose ahead of U.S. jobs data. "
[2] "The euro has surged across the board this week, logging its best weekly performance against the dollar since March 2009. "                                                                                                                    
[3] "Against the pound, it was set for a weekly gain of 1.5%, the most since January 2023. "                                                                                                                                                       
[4] "It was last up 0.4% at 84.03 pence. "                                                                                                                                                                                                         
[5] "The pound was up 0.4% against the dollar at $1.292."                                                                                                                                                                                          
  • using your previous code, extract the fourth sentence from the list:
Code
stringr::str_split(reuters, stringr::boundary("sentence"))[[1]][4]
[1] "It was last up 0.4% at 84.03 pence. "
  • use the pipe operator to: extract the fourth sentence from the text and split that sentence in words:
Code
stringr::str_split(reuters, stringr::boundary("sentence"))[[1]][4] |>
  stringr::str_split(stringr::boundary("word"))
[[1]]
[1] "It"    "was"   "last"  "up"    "0.4"   "at"    "84.03" "pence"

Create a character variable with “var_1”, … “var_5” that you would use to add names to a vector. Save this vector in vec_names.

Code
vec_names <- paste("var", 1:5, sep = "_")
vec_names
[1] "var_1" "var_2" "var_3" "var_4" "var_5"

What would happen is you use collapse = "_" and not sep = "_"?

Code
vec_namesc <- paste("var", 1:5, collapse = "_")
vec_namesc
[1] "var 1_var 2_var 3_var 4_var 5"

4.1.10 Factors

4.1.10.1 Definition

Factors are a special vector and are used to represent categorical variables. Categorical variables can take a limited number of known values (often referred to as levels). Examples of categorical variables include nominal variables and ordinal variables. The first, nominal variables, have two or more categories but these have no intrinsic ordering. In other words, you can not take one value of a nominal variables and say that it is higher, lower, bigger, smaller … than another value. Hair color, the name of a city, country or continent, a yes/no reply in a questionnaire or the name of a month are examples of nominal variables. You can order them alphabetically, or, for months, as they appear in a year, but any other ordering wouldn’t affect they way you handle them. In other words, if you would recode city names as numeric variables (1 = Amsterdam, 2 = Brussels, 3 = Copenhagen, …) these numeric values wouldn’t have any meaning. Ordinal variables differ from nominal variables as they have an intrinsic ordering. Examples include e.g. educational experience (elementary school, high school, some college, bachelor’s degree, master’s degree, PhD) or price categories measured as “budget” or “premium”. If you would recode these variables as numeric variables, their level would matter. For instance, you would recode “elementary school” as “1”, “high school” as “2”, “some college” as “3”, “bachelor’s degree” as “4”, “master’s degree” as “5” and “PhD” as “6” or, for price categories, “budget” as “1” and “premium” as “2”. However, these categories are not equally spaced. In other words, the difference between the numeric values for “high school” and “elementary school” (2 - 1 = 1) isn’t the same as the difference between “PhD” and “master’s degree” (6 - 5 = 1). In other words, the categories are not equally spaced.

4.1.10.2 Creating a factor

In addition to base R factor function, {forcats} - a package included in the {tidyverse} - includes a lot of functions to manipulate factor variables. As we did with {stringr} and {lubridate} function, I’ll include forcats:: at the start of a function if that function is part of that package. If forcats:: is not part of the function call, the function is a base R function. Recall that all {stringr} functions start with str_. In a similar way, all {forcats} function start with fct_ and (most) are follewed by a verb.

Suppose that you have a variable that records months:

vec_month1 <- c("Sep", "Aug", "Oct", "Jan", "Nov", "Mar", "Dec", "Apr", "Jun", "May", "Feb", "Jul" )

Recall from Chapter 2, that these months don’t sort in a meaningful way:

sort(vec_month1)
 [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"

To fix this, we can create a factor and include a vector of valid levels. These levels are ordered in a meaningful. For instance:

vec_month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

We can not encode vec_month1 as a factor, including the levels using factor(x = , levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA). Here, x is the vector with the data you want to encode as a factor. The levels are optional, and are included in a vector of unique values that x might take and are included as character. R assumes by default that his vector with levels is sorted in increasing order of x. If these levels are not included, R uses sort(unique(x)) to set levels. The labels = levels allows you to add labels to the levels. These labels allow you to include more descriptive term for every level. This is especially useful if the levels are recorded as numeric. By default, R sets these labels equal to the levels. You can exclude some values. In the case, you include a vector with the values to exclude after exclude =. By default, all unique values in x are treated as a separate factor. For instance, if your data in x includes missing values, exclude = NULL will treat these missing values as a separate level. By default that level is the last level. is.ordered(x) is by default FALSE. If that is set to TRUE, R will treat the factors as ordinal variables. The last argument allows you to restrict the number of factors if x includes a lot of unique values.

Let’s see what these options do. First, let’s accept all default values:

vec_fac1 <- factor(x = vec_month1)
vec_fac1
 [1] Sep Aug Oct Jan Nov Mar Dec Apr Jun May Feb Jul
Levels: Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep

The output shows vec_month1 first and all levels next. As the command didn’t include levels, R ordered the levels using sort(unique()):

sort(unique(vec_month1))
 [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"

If you add levels, R will change the order and follow the order in the levels argument.

vec_fac1 <- factor(x = vec_month1, levels = vec_month_levels)
vec_fac1
 [1] Sep Aug Oct Jan Nov Mar Dec Apr Jun May Feb Jul
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Now, the months are ordered as they were in vec_month_levels. Note the levels that do not occur in the x are dropped.

Adding labels allows you to add more descriptive terms. For the months, these labels could be the months written in full:

vec_month_labels <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

Using these labels:

vec_fac1 <- factor(x = vec_month1, levels = vec_month_levels, labels = vec_month_labels)
vec_fac1
 [1] September August    October   January   November  March     December 
 [8] April     June      May       February  July     
12 Levels: January February March April May June July August ... December

All months are now written in full. Note that using these labels to set the levels is not possible. R searches tries to match every value in the vector it has to encode as factor with a level in the levels vector. In other words, if R encounters “Jan” in the vector it had to encode, it searches for “Jan” in the levels vector. If that value is missing as a level, is will report NA in the vector it had to encode. Here R silently converts any values in the vector to encode that it doesn’t find in the levels vector into NA.

The {forcats}’s fct(x, levels = NULL, na = character()) function allows to create a vector, but avoids that missing values are silently encoded as NA. The first argument is the vector to encode as factor, the second the vector used for levels (NULL or none by default) and the third optional argument allows you to include the values in x that fct() should treat as NA. Let’s first show how you can use this function:

vec_fac2 <- forcats::fct(x = c("Apr", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"))
vec_fac2
[1] Apr Feb Jan
Levels: Jan Feb Apr

New let’s add a typo and write “Apr” as “Arp”.

vec_fac3 <- forcats::fct(x = c("Arp", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"))
Error in `forcats::fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "Arp"
vec_fac3
Error: object 'vec_fac3' not found

Here, R produces an error: ! All values of x must appear in levels or na. Missing level: "Arp". As you can see from the error, I warns that a value in the x vector was not included in the levels vector. If also shows its value: “Arp”. Using base R’s factor would add an NA without warning:

vec_fac3 <- factor(c("Arp", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"))
vec_fac3
[1] <NA> Feb  Jan 
Levels: Jan Feb Apr

Here, base R changes “Arp” into “NA”. If the typo is undetected, which is likely in large datasets, this would affect your analysis. In {forcats} you need to include “Arp” in the na = argument if you want to avoid an error. In other words, you have to instruct R to treat “Arp” as a missing value:

vec_fac3 <- forcats::fct(x = c("Arp", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"), na = c("Arp"))
vec_fac3
[1] <NA> Feb  Jan 
Levels: Jan Feb Apr

You can see from the output that “Arp” is now indeed recored as a missing value. This is how you instructed R to treat “Arp”. There is a second difference between both functions. Base R’s factor() orders using sort(unique(x)) in case the levels argument is missing. {forcats} fct() orders by first appearance. In other words, it uses the character vector to encode an including an implicit order.

Factors can include numeric values. Suppose you have a yes/no reply to an answer where “No” recorded as 0 and “Yes” as 1. You could encode that vector as a factor using:

vec_fac2 <- factor(x = c(1, 1, 1, 0, 0, 1, 0), levels = c(0, 1), labels = c("No", "Yes"))
vec_fac2
[1] Yes Yes Yes No  No  Yes No 
Levels: No Yes

Note that {forcats} needs a character vector to encode as factor. Including a numeric factor causes an error.

forcats::fct(x = c(1, 1, 1, 0, 0, 1, 0), levels = c(0, 1))
Error in `forcats::fct()`:
! `x` must be a character vector, not a double vector.

To created an ordered factor, you need to change ordered = is.ordered(x) to ordered = TRUE. Doing so creates an ordinal factor. Suppose you have income levels from a survey recorded as low = 1, medium = 2 and high = 3. Creating an ordered factor:

vec_ord1 <- factor(c(1, 2, 3, 3, 1, 1, 2, 2, 1), levels = c(1, 2, 3), labels = c("Low income", "Medium income", "High income"), ordered = TRUE)
vec_ord1
[1] Low income    Medium income High income   High income   Low income   
[6] Low income    Medium income Medium income Low income   
Levels: Low income < Medium income < High income

Here you see that the output shows the levels as well as their ordering: low income is lower than medium and medium income is lower than high income.

You can check if a vector is a factor using is.factor() and if it is an ordered factor using is.ordered().

is.factor(vec_ord1)
[1] TRUE
is.ordered(vec_ord1)
[1] TRUE

You can coerce a vector into a factor or ordered factor using as.factor() or as.ordered(). Using c(1, 2, 1, 3, 2, 1) as an example

as.factor(c(1, 2, 1, 3, 2, 1))
[1] 1 2 1 3 2 1
Levels: 1 2 3
as.ordered(c(1, 2, 1, 3, 2, 1))
[1] 1 2 1 3 2 1
Levels: 1 < 2 < 3

you can see that both functions create a factor. The ordered factor is created from sort(unique(x)). In other words, as.ordered() assumes that the values in the vector to encode as factor are listed in the correct order.

4.1.10.3 Usefull factor functions

In plots, it is often useful to reorder factor levels. For instance, if you would plot the population of a city where cities are encoded as factors and are alphabetically ordered, that plot would show these cities on the horizontal or vertical axis in that order. To produce a nice plot, it might be more convenient to have these cities ordered in terms of their population. In that way, the smallest city would show up on the left of the horizontal axis and the largest city on the right. To do so, you can use {forcats}’ fct_reorder() function. This function’s first argument is the factor to reorder. The second argument is the variable that R needs to use to reorder. The third argument, fun = median shows the summary function R uses to reorder. For each factor level, R calculates the value of the function in fun and uses this value to reorder the factors. Using the default na_rm = NULL R removes missing values with a warning. Changing that into TRUE will cause R to remove them without a warning and FALSE preserves the NA’s. By default, R orders descending. Adding desc = FALSE changes this default.

{forcats} fct_recode() and fct_collapse() allow to modify the factor levels. fct_recode() allows you to recode factor levels. To do so, you need to use fct_recode(x, "new value" = "old value") where x is the factor and the statement “new value” = “old value” is entered every old factor level that you need to change. If an “old value” is not included, R assumes that it remains as is. To illustrate, we first create a factor:

vec_fac1 <- factor(x = c(1, 2, 3, 4), levels = c(1, 2, 3, 4), labels = c("small city", "large city", "small town", "large town"))
vec_fac1
[1] small city large city small town large town
Levels: small city large city small town large town

Let’s now recode to show levels “city, small”, “city, large”, “town, small” and “town, large”:

forcats::fct_recode(vec_fac1, 
                    "city, small" = "small city",
                    "city, large" = "large city", 
                    "town, small" = "small town", 
                    "town, large" = "large town")
[1] city, small city, large town, small town, large
Levels: city, small city, large town, small town, large

Note that you can use this function to reduce the number of levels. For instance, if you want to drop the difference between “small” and “large” and only keep “city” and “town”, you can recode all “small city” and “large city” to “city”:

forcats::fct_recode(vec_fac1, 
                    "city" = "small city",
                    "city" = "large city", 
                    "town" = "small town", 
                    "town" = "large town")
[1] city city town town
Levels: city town

Note that here you will loose these 4 factor levels if you recode and assign to the same factor.

{forcats}’ fct_collapse() function performs a similar task. It allows you to collaps various factor levels. The function’s argument is similar to those for recode. For instance, suppose you want to recode “small” and “large” in one level and you would use fct_collapse():

forcats::fct_collapse(vec_fac1, 
                    "city" = c("small city", "large city"), 
                    "town" = c("small town", "large town"))
[1] city city town town
Levels: city town

Here, you use city" = c("small city", "large city") to collapse the levels on the right hand side of the equality sign into the level on the left hand side.

Suppose that you have a variable with the values “Asi”, “Afr”, “Eur”, “Ame”, “Oce”. These values stand for “Asia”, “Africa”, “Europe”, “Americas” and “Oceania”. Create a factor that will show these contintents in alfabetical order and add labels. Use cont to store this variable.

Code
cont <- factor(c("Asi", "Afr", "Eur", "Ame", "Oce"), levels= c("Afr", "Ame", "Asi", "Eur", "Oce"), labels = c("Africa", "Americas", "Asia", "Europe", "Oceania"))
cont
[1] Asia     Africa   Europe   Americas Oceania 
Levels: Africa Americas Asia Europe Oceania

Is cont a factor?

Code
is.ordered(cont)
[1] FALSE

Is cont an ordered factor?

Code
is.ordered(cont)
[1] FALSE

To measure an individual’s education, the following values are used in your dataset: “some high school”, “high school”, “some college”, “bachelor”, “master”, “PhD”. These values are including using numbers: 1 (some high school), 2 (high school), … 6 (PhD). var_school shows such a variable.

var_school <- sample(1:6, 20, replace = TRUE)

Created an ordered factor including labels. Assign this factor to school.

Code
school <- factor(var_school, levels= c(1, 2, 3, 4, 5, 6), labels = c("some high school", "high school", "some college", "bachelor", "master", "PhD"), ordered = TRUE)
school
 [1] some college     bachelor         bachelor         PhD             
 [5] some high school bachelor         master           master          
 [9] master           PhD              some high school master          
[13] PhD              high school      master           some college    
[17] master           some high school PhD              master          
6 Levels: some high school < high school < some college < ... < PhD

Use {forcats} to recode school and reduce the number of levels by merging “high school” and “some college” into “secondary” and merging “bachelor” and “master” into “tertiary”. There are two ways to do this:

  • Option 1:
Code
forcats::fct_recode(school, 
                    "secondary" = "high school", 
                    "secondary" = "some college", 
                    "tertiary" = "bachelor", 
                    "tertiary" = "master")
 [1] secondary        tertiary         tertiary         PhD             
 [5] some high school tertiary         tertiary         tertiary        
 [9] tertiary         PhD              some high school tertiary        
[13] PhD              secondary        tertiary         secondary       
[17] tertiary         some high school PhD              tertiary        
Levels: some high school < secondary < tertiary < PhD
  • Option 2:
Code
forcats::fct_collapse(school, 
                    "secondary" = c("high school", "some college"),
                    "tertiary" = c("bachelor", "master")) 
 [1] secondary        tertiary         tertiary         PhD             
 [5] some high school tertiary         tertiary         tertiary        
 [9] tertiary         PhD              some high school tertiary        
[13] PhD              secondary        tertiary         secondary       
[17] tertiary         some high school PhD              tertiary        
Levels: some high school < secondary < tertiary < PhD

4.2 Matrices

Matrices are two-dimensional object that allow to store data in rows and columns. Recall that vectors stored data in one row and one or more columns. Like vectors, matrices are homogeneous. In other words, they store numeric or character or boolean or data/time values but not a combination of two or more datatypes. Most of what we discusses for vectors also applies to matrices. As a matter of fact, you can think of a vector as a special case of a matrix: it is a matrix with one row and one or more columns. However, if you want to use the vector as a matrix, you need to create a matrix with 1 row and n columns.

An “mxn” matrix has m rows and n columns. In general, the value on the ith row and jth column is referred to as matrix-name(i,j). We’ll see in the next section how you subset a matrix. A matrix with the same number of rows as there are columns, i.e. an nxn matrix is also called a square matrix. Here we will focus on numeric matrices. However, as long as all elements in a matrix are the same, a matric would also include characters, logical values, integers or data/time variables. As you will see here, most of what we learned for vectors also applies to matrices.

4.2.1 Creating a matrix

We will first show how to create a matrix in general. We then move to a couple of special matrices.

4.2.1.1 The basics

To create a matrix, you use the matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) function. The first argument is optional and allows you to add a vector with data to fill the matrix. The second and third argument, ncol = 1 and nrow = 1 determine the size of the matrix: the desired number of rows and columns. If you add a vector that R needs to use to fill the data, byrow = FALSE instructs R to fill the matrix by column. In other words, if you have 4 rows and 5 columns, R first fills all rows of the first column, the all rows of the second, … to end with all 4 rows of the 5th column. Changing this default into TRUE tells R to first fill the rows. In other words, R will now first fill the 5 columns of the first row, then move to the second row and fill all columns in that row, … . The last argument allow you to add a name to the row and column dimensions. Let’s use a 2x3 matrix mat_0:

mat_0 <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
mat_0
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

As you can see, R creates a matrix with two rows [1, ] and [2, ] and three columns [,1], [,2], [,3]. R added the values in c() by column (default: byrow = FALSE). It took the first two values in c() (1 and 2) and used these to fill the first column. The next two values, 3 and 4, were added to the second column. The last two values in c() are shown in the last column.

The attributes of mat_0 include the dimenions of the matrix: the number of rows and the number of columns:

attributes(mat_0)
$dim
[1] 2 3

To create mat_0 we included its elements via c(). The argument data can include vectors or function. For instance, let’s use a 1x6 vector vec_1 to illustrate this. We’ll fill vec_1 with a sequence of 1 to 6 using the shorthand for seq(from = 1, to = 6, by = 1):

vec_1 <- 1:6

We can know create a matrix mat_1 (accepting the default values for byrow and dimnames):

mat_1 <- matrix(vec_1, nrow = 2, ncol = 3)
mat_1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

As you can see, R filled this matrix in the same way as R did for mat_0. Note that we didn’t have to create a separate vector vec_1. We could have included 1:6 as the first argument.

In both examples, the length of c() or vec_1 was equal to the number of cells in the matrix: the number of cells equals nrow * ncol = 6 and the length of vec_1 or c() was also 6. If that is not the case, R reports an warning. If the number of cells in the matrix is larger than the length of the vector, R will use some of all values more than once. Suppose that you have a vector with 4 columns that needs to fill a matrix with 2 rows and 3 columns:

matrix(1:4, nrow = 2, ncol = 3)
Warning in matrix(1:4, nrow = 2, ncol = 3): data length [4] is not a
sub-multiple or multiple of the number of columns [3]
     [,1] [,2] [,3]
[1,]    1    3    1
[2,]    2    4    2

R will use the first two observations of the vector two times. After having used all 4 columns of 1:4 to fill the first two columns, R uses the same vector again to fill the other cells. In this example, R used 1 and 2 of the sequence 1:4 twice. Note that R shows a warning that the dimensions of the vector and matrix didn’t fit.

If the length of the vector is longer than the number of cells in the matrix,

matrix(1:9, 2, 3)
Warning in matrix(1:9, 2, 3): data length [9] is not a sub-multiple or multiple
of the number of rows [2]
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

R uses only the first nrow * ncol columns of the vector. Here R used the first 6 columns of the vector 1:9 to fill the matrix, and dropped the others. Again, R shows a warning message that the dimensions didn’t fit.

You can also create a matrix using the dim() function to a vector. The next examples shows how this works:

vec_1 <- 1:6
dim(vec_1) = c(2, 3)
vec_1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

You can use all others functions that generate a vector to fill a matrix. Examples include:

  • a matrix with random numbers:
matrix(rnorm(6), nrow = 2, ncol = 3)
           [,1]      [,2]       [,3]
[1,] 0.02905913  1.288859 -0.1529046
[2,] 0.46637676 -0.674857  0.2705640
  • using letters:
matrix(letters[1:6], nrow = 2, ncol = 3)
     [,1] [,2] [,3]
[1,] "a"  "c"  "e" 
[2,] "b"  "d"  "f" 
  • as a sample:
matrix(sample(1:1000, 6), nrow = 2, ncol = 3)
     [,1] [,2] [,3]
[1,]  817  426  412
[2,]  573  530  550
  • using a set operator on two vectors generated as 1:12 and 7:18:
matrix(base::intersect(1:12, 7:18), nrow = 2, ncol = 3)
     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

Let’s now see what byrow = TRUE changes to the outcome of matrix():

mat_1 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
mat_1
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Recall that with the default for byrow, R first filled the first column, then the second and then the third. As you can see from the output now, R now filled the first row first putting the first value of vec_1 in the first column, the second value in the second and the third value in the third column. As all columns in the first row were filled, R moved to the second row and used the fourth value of vec_1 to fill the first column on the second row, the fifth value to fill the second column on the second row and the sixth value in the third column on the second row.

The last option allows you to specify row and column names. To do so, you need to collect them in a list. We’ll see shortly that lists are yet another data structure in R. Note that there are other ways to set column and row names.

mat_1 <- matrix(vec_1, nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("row1", "row2"), c("var1", "var2", "var3")))
mat_1
     var1 var2 var3
row1    1    2    3
row2    4    5    6

The output now shows the row and column names. These names are added to the attributes of the matrix as dimnames[[1]] for the rows and dimnames[[2]] for the columns:

attributes(mat_1)
$dim
[1] 2 3

$dimnames
$dimnames[[1]]
[1] "row1" "row2"

$dimnames[[2]]
[1] "var1" "var2" "var3"

As alternative to add row and column names are the functions rownames() and colnames(). Let’s first recreate mat_1 without names:

mat_1 <- matrix(1:6, nrow = 2, ncol = 3)

You can use colnames() in two ways. The argument of this function is a vector with column names. Suppose you want to add the following names to the columns of mat_1: c("var1", "var2", "var3"). The first way to do so is to use

colnames(mat_1) <- c("var1", "var2", "var3")
mat_1
     var1 var2 var3
[1,]    1    3    5
[2,]    2    4    6

To add rownames, you can use rownames(). This function requires a vector with the names: c("row1", "row2"). To add these names:

rownames(mat_1) <- c("row1", "row2")
mat_1
     var1 var2 var3
row1    1    3    5
row2    2    4    6

If you only want to use names for column and rows that include a prefix and a row or column number, e.g. “col1” or “row1 then there is a shortcut where you don’t have to type all row or column names. Using colnames(x, do.NULL = TRUE, prefix = "col") you can specify the matrix in x and the prefix in prefix = "col". The argument do.NULL is by default TRUE. This default do.NULL = TRUE adds no names. Changing that into FALSE tells R to add names.

colnames(mat_1) <- colnames(mat_1, do.NULL = FALSE, prefix = "var_")
mat_1
     var1 var2 var3
row1    1    3    5
row2    2    4    6

You can do the same for the rows and add names using a prefix, e.g. “obs” and the row number:

rownames(mat_1) <- rownames(mat_1, do.NULL = FALSE, prefix = "obs_")
mat_1
     var1 var2 var3
row1    1    3    5
row2    2    4    6

Note that there are many ways you can use the character functions to automate the process of naming rows and columns. As an illustration, let’s rewrite

rownames(mat_1) <- c("row1", "row2")
mat_1
     var1 var2 var3
row1    1    3    5
row2    2    4    6

using the paste0() function:

rownames(mat_1) <- paste0("row", 1:2)
mat_1
     var1 var2 var3
row1    1    3    5
row2    2    4    6

and

colnames(mat_1) <- c("var1", "var2", "var3")
mat_1
     var1 var2 var3
row1    1    3    5
row2    2    4    6

using the paste() function:

colnames(mat_1) <- paste(1:3, c("st", "nd", "rd"), sep="")
mat_1
     1st 2nd 3rd
row1   1   3   5
row2   2   4   6

If your matrix has column or row names, you can show these using the same colnames() or rownames() function. For instance:

colnames(mat_1)
[1] "1st" "2nd" "3rd"
rownames(mat_1)
[1] "row1" "row2"

4.2.1.2 Special matrices

Recall that vectors are data structures with 1 row and one or more columns. If you need to work with matrix algebra and use vectors, it is best to create a vector explicitly as a matrix. To do so, you need a matrix with 1 row and e.g. 3 columns:

mat_vec <- matrix(1:3, 1, 3)

There are a couple of special matrices. Using the matrix function, we can create a mxn matrix with one constant value.

mat_2 <- matrix(5, nrow = 2, ncol = 3)
mat_2
     [,1] [,2] [,3]
[1,]    5    5    5
[2,]    5    5    5

Here there are two special cases: a square matrix filled with ones;

J <- matrix(1, nrow = 3, ncol = 3)
J
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1

and the zero matrix: a mxn matrix with zero’s:

zeros <- matrix(0, nrow = 2, ncol = 3)
zeros
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

A diagonal matrix is a square matrix where all values are equal to zero except those on the diagonal:

diag(c(10, 11, 12), nrow = 3, ncol = 3)
     [,1] [,2] [,3]
[1,]   10    0    0
[2,]    0   11    0
[3,]    0    0   12

A special case of this diagonal matrix is the identity matrix: a diagonal matrix whose diagonal elements are equal to 1:

ident <- diag(1, nrow = 3, ncol = 3)
ident
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

The last special case, is a vector (a 1xn matrix), whose elements are all 1:

vec_ones <- matrix(1, 1, 3)
vec_ones
     [,1] [,2] [,3]
[1,]    1    1    1

Triangular matrices are square matrices where all elements below the diagonal are 0 (upper triangular) or all elements above the diagonal are 0 (lower triangular). This, in addition to these elements, the elements on the diagonal are also 0, the square matrix is strict triangular. The functions upper.tri(x, diag = FALSE) and lower.tri(x, diag = FALSE) can be used to create those matrices. These function return a logical matrix whose elements are TRUE if it above the diagonal (upper, with diag = FALSE) and FALSE is this is not the case. With diag = TRUE, the logical values on the diagonal will also be TRUE. The interpretation for the lower triangular function are identical, with the exception that TRUE in this case is for elements below or below and on the diagonal. To see how these function work, we’ll use:

mat_1 <- matrix(1:25, 5, 5)

Using upper.tri() as an example to show the logical matrix:

upper.tri(mat_1, diag = FALSE)
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] FALSE  TRUE  TRUE  TRUE  TRUE
[2,] FALSE FALSE  TRUE  TRUE  TRUE
[3,] FALSE FALSE FALSE  TRUE  TRUE
[4,] FALSE FALSE FALSE FALSE  TRUE
[5,] FALSE FALSE FALSE FALSE FALSE

We can now use this logical matrix to change mat_1 into a lower triangular matrix whose elements on the diagonal differ from 0:

mat_1[upper.tri(mat_1, diag = FALSE)] <- 0
mat_1
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    2    7    0    0    0
[3,]    3    8   13    0    0
[4,]    4    9   14   19    0
[5,]    5   10   15   20   25

Note that here, we use the function upper.tri() to create a lower triangular matrix. To create a strict lower diagonal matrix, you can change the default value diag = FALSE in diag = TRUE. Doing so allows you to create a strict lower triangular matrix:

mat_1 <- matrix(1:25, 5, 5)
mat_1[upper.tri(mat_1, diag = TRUE)] <- 0
mat_1
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    2    0    0    0    0
[3,]    3    8    0    0    0
[4,]    4    9   14    0    0
[5,]    5   10   15   20    0

4.2.1.3 Coercing objects to matrix class

For R, a vector is not a matrix. You can see that if you ask R what class vec_1 is and compare that result with the class of mat_1:

class(vec_1)
[1] "matrix" "array" 
class(mat_1)
[1] "matrix" "array" 

The as.matrix(x) function tries to turn the object x into a matrix. Doing so, as.matrix() keeps the dimensions of x. In other words, it will change the vector vec_1 into a matrix with 6 rows and 1 column.

mat_1 <- as.matrix(vec_1)
mat_1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

You can change a data frame into a matrix in a similar way. Recall from Chapter 1 that R includes a dataset mtcars. A data frame is a data structure we will discuss shortly:

class(mtcars)
[1] "data.frame"

Recall that this data frame had 32 observations for 11 variables. This data frame includes variable names and identifies every observation:

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

With as.matrix() you can change this data frame into a matrix:

mat_mtcars <- as.matrix(mtcars)

This matrix has column and row names.

colnames(mat_mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
rownames(mat_mtcars)
 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
 [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
 [7] "Duster 360"          "Merc 240D"           "Merc 230"           
[10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
[16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
[19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
[22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
[28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
[31] "Maserati Bora"       "Volvo 142E"         

Recall that a matrix is homogeneous: all elements must be of the same type. Often data frames are heterogeneous: they include numeric, character, data/time, boolean of factor variables. Using the data.matrix() function, R will change this data frame in a numeric matrix by converting all variables to numeric first. For instance, suppose that you have a data.frame df:

df <- data.frame(A = 1:3, B = letters[1:3], C = seq.Date(as.Date("2025-03-25"), by = "day", length.out = 3))
df
  A B          C
1 1 a 2025-03-25
2 2 b 2025-03-26
3 3 c 2025-03-27

If you would use as.matrix() R would convert this data frame into a character matrix:

mat_df <- as.matrix(df)
typeof(mat_df)
[1] "character"

Using data.matrix() avoids this:

mat_df <- data.matrix(df)
mat_df
     A B     C
[1,] 1 1 20172
[2,] 2 2 20173
[3,] 3 3 20174

As you can see, the second column, B, has been changed to numeric. R changed the value of “a” into 1, “b” into 2, … . Here, all unique values are given a different numeric value. In addition, R used the fact that dates are numeric to change the date into numeric.

4.2.2 Matrix type and attributes

4.2.2.1 Matrix type

Here we first created the matrix using matrix(). However, it is also possible that you create a matrix witin your code. To check is an object is a matrix, you can use the is.matrix() function.

is.matrix(mat_1)
[1] TRUE

If an object is a matrix, this function show TRUE. If this is not the case, the function shows FALSE. If an object is a matrix, you can check its type using typeof():

typeof(mat_1)
[1] "integer"

Here, mat_1 is an integer matrix. In other words, its values are of type “integer”. Recall that this matrix was creates from the sequence 1:6. In other words, it was created as an integer value. The type of a matrix is determined is a similar way as a vector. To illustrate, let’s define 3 matrices

mat_n <- matrix(rnorm(6), 2, 3)
mat_c <- matrix(letters[1:6], 2, 3)
mat_d <- matrix(seq(as.Date("2025-03-25"), by = "day", length.out = 6), 2, 3)

and check there type:

typeof(mat_n)
[1] "double"
typeof(mat_c)
[1] "character"
typeof(mat_d)
[1] "double"

As you can see, R stores the dates in mat_d as numeric variables with class “Matrix” “Array”. Recall that dates are stored as numbers. In other words you can turn these numbers into dates using one of the function we have covered, e.g. as.Date() or {lubridate}’s ymd().

It is important to stress that matrices are homogenious in the sense that they can only include values of one type. For instance, the following line changes the value of the element on the second row and second column in mat_d in a character variable.

mat_n[2,2] <- "10000"
mat_n
     [,1]                [,2]               [,3]               
[1,] "0.982499218575708" "0.42216077988595" "0.22435607224835" 
[2,] "-1.01112994823057" "10000"            "0.125027643014595"

The output suggest that R coerced all other elements into character variables. Indeed, the type of this matrix is

typeof(mat_n)
[1] "character"

now a character matrix. We have seen similar behavior with vectors.

4.2.2.2 Matrix attributes

A matrix has some attributes that you will often use in code. To check the attributes of an object, you can use R’s attributes() function. To illustrate this function, we’ll use

mat_1 <- matrix(vec_1, nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("row1", "row2"), c("var1", "var2", "var3")))
mat_1
     var1 var2 var3
row1    1    2    3
row2    4    5    6

The function shows which the attributes for the object:

att_mat1 <- attributes(mat_1)
att_mat1
$dim
[1] 2 3

$dimnames
$dimnames[[1]]
[1] "row1" "row2"

$dimnames[[2]]
[1] "var1" "var2" "var3"

Here, you see the various attributes: the dimension $dim, the row names dimnames[1] and colum names dimnames[2]. There are multiple ways to access these attributes. For instance, the dimension of the matrix includes the number of rows (2) and the number of columns (3). To extract the number these dimensions you can use dim(). This function shows the number of rows and column in mat_1.

dim(mat_1)
[1] 2 3

You can use these to extract the number of rows and columns. To do so, you assign the result of this function to an object.

dim_mat1 <- dim(mat_1)
dim_mat1
[1] 2 3

You can now subset this result:

nobs <- dim_mat1[1]
nvar <- dim_mat1[2]

These values now store the number of rows (nobs) and the number of columns (nvar).

If you are only interested in the number of rows or number of columns, you can extract these using nrow() or ncol():

nrow(mat_1)
[1] 2
ncol(mat_1)
[1] 3

Note that in many cases, the number of rows will be equal to the number of observations in your dataset while the number of column is equal to the number of variables.

To see the total number of values in the matrix, or the product of the number of rows and the number of columns, you can use length():

length(mat_1) 
[1] 6
nrow(mat_1) * ncol(mat_1)
[1] 6

If you need the column or row names, you can use colnames() or rownames():

colnames(mat_1)
[1] "var1" "var2" "var3"
rownames(mat_1)
[1] "row1" "row2"

If you store these names, you can use them in your code.

Create a 3x3 matrix, mat_0 using a sequence from 21:29

Code
mat_0 <- matrix(21:29, 3, 3)
mat_0
     [,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29

Using the same values, fill this matrix by row:

Code
mat_0 <- matrix(21:29, 3, 3, byrow = TRUE)
mat_0
     [,1] [,2] [,3]
[1,]   21   22   23
[2,]   24   25   26
[3,]   27   28   29

Store the numbers 21-29 in a vector mat_0 and create a matrix using the dim() function

Code
mat_0 <- 21:29
dim(mat_0) <- c(3, 3)
mat_0
     [,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29

What happens if you use 1:4 to create a 2x3 matrix mat_0, filled by column? Predict the value in mat_0[2, 3].

Code
mat_0 <- matrix(1:4, 2, 3)
Warning in matrix(1:4, 2, 3): data length [4] is not a sub-multiple or multiple
of the number of columns [3]
Code
mat_0[2, 3]
[1] 2

What is the value in mat_0[2, 3] you fill the matrix with 1:9

Code
mat_0 <- matrix(1:9, 2, 3)
Warning in matrix(1:9, 2, 3): data length [9] is not a sub-multiple or multiple
of the number of rows [2]
Code
mat_0[2 ,3]
[1] 6

Create 3x3 a named matrix mat_0 with elements 21-29 with row names “obs_1”, “obs_2”, … and column names “var_1”, “var_2”

Code
mat_0 <- matrix(21:29, 3, 3, dimnames = list(c("obs_1", "obs_2", "obs_3"), c("var_1",  "var_2", "var_3")))
mat_0
      var_1 var_2 var_3
obs_1    21    24    27
obs_2    22    25    28
obs_3    23    26    29

Here, you had to write down all names. First recreate mat_0 without names and then use the rownames and colnames function to set the names. Using these function, try to avoid writing all names. There are two ways to do so.

  • option 1:
Code
mat_0 <- matrix(21:29, 3, 3)
rownames(mat_0) <- paste("obs", 1:3, sep = "_")
colnames(mat_0) <- paste("var", 1:3, sep = "_")
mat_0
      var_1 var_2 var_3
obs_1    21    24    27
obs_2    22    25    28
obs_3    23    26    29
  • Option 2
Code
mat_0 <- matrix(21:29, 3, 3)
rownames(mat_0) <- rownames(mat_0, do.NULL = FALSE, prefix = "obs_")
colnames(mat_0) <- colnames(mat_0, do.NULL = FALSE, prefix = "var_")
mat_0
      var_1 var_2 var_3
obs_1    21    24    27
obs_2    22    25    28
obs_3    23    26    29

Create a 4x4 identity matrix ident

Code
ident <- diag(1, 4, 4)
ident
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

Determine the number of rows and columns for this matrix:

mat_0 <- matrix(rnorm(1000), 500, 2)
colnames(mat_0) <- c("var_1", "var_2")
rownames(mat_0) <- paste("obs", 1:500, sep = "_")
  • Option 1:
Code
nrow(mat_0)
[1] 500
Code
ncol(mat_0)
[1] 2
  • Option 2:
Code
attributes(mat_0)$dim[1]
[1] 500
Code
attributes(mat_0)$dim[2]
[1] 2

Determine the type of mat_0:

Code
typeof(mat_0)
[1] "double"

Fill a 3x3 matrix, mat_1, with the first 9 letters of the alfabet, lowercase

Code
mat_1 <- matrix(letters[1:9], 3, 3)
mat_1
     [,1] [,2] [,3]
[1,] "a"  "d"  "g" 
[2,] "b"  "e"  "h" 
[3,] "c"  "f"  "i" 

Fill a 3x3 matrix, mat_2 with a sequence of dates, starting 2025-04-01 and ending 2025-04-09.

Code
mat_2 <- matrix(seq.Date(from = as.Date("2025-04-01"), to = as.Date("2025-04-09"), by = "days"), 3, 3)
mat_2
      [,1]  [,2]  [,3]
[1,] 20179 20182 20185
[2,] 20180 20183 20186
[3,] 20181 20184 20187

Create a 3x3 boolean matrix, mat_3, using random sample from TRUE and FALSE

Code
mat_3 <- matrix(sample(c(TRUE, FALSE), 9, replace = TRUE), 3, 3)
mat_3
      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE FALSE
[3,] FALSE  TRUE FALSE

4.2.3 Subsetting a matrix

4.2.3.1 Subsetting by position

Subsetting a matrix uses an approach which is very similar to the one used for a vector. However, with a matrix you have both rows as well as columns. This allows you to subset both individual elements, all rows on one of multiple columns, all columns on one or multiple rows or a range of elements spread over some columns and some rows. Matrix mat will be used to illustrate these approaches:

mat <- matrix(c(11, 21, 31, 41, 12, 22, 32, 42, 13, 23, 33, 34, 41, 42, 43, 44), nrow = 4, ncol = 4)
mat
     [,1] [,2] [,3] [,4]
[1,]   11   12   13   41
[2,]   21   22   23   42
[3,]   31   32   33   43
[4,]   41   42   34   44

As you can see, the elements of the matrix are equal to their row-column indices.

To subset an individual element, you can use mat[m, n] with m the row index and n the column index. For instance, extracting the element in the second row and the third column:

mat[2, 3]
[1] 23

If you assign the outcome to a new variable, you can use it in your code.

You can extract an entire column using mat[, n]. For instance, extracting the 4th column of mat:

mat[, 4]
[1] 41 42 43 44

Subsetting a specific row using a similar approach. To subset row m, you use mat[m, ]. For instance, subsetting the 3rd row of mat:

mat[3, ]
[1] 31 32 33 43

Note that R shows the simplest possible data structure. Subsetting a row or column, results in a numeric vector. To see this, let’s use is.vector() and ask for the class of mat[3, ]:

class(mat[3, ])
[1] "numeric"
is.vector(mat[3, 1])
[1] TRUE

To preserve the structure, you need to add drop = FALSE within the subsetting operations. For instance,

mat[3, , drop = FALSE]
     [,1] [,2] [,3] [,4]
[1,]   31   32   33   43

preserves the structure of the matrix. You can see this from the result, which is now shown as a matrix, as well as from the logical operators

is.vector(mat[3, , drop = FALSE])
[1] FALSE
is.matrix(mat[3, , drop = FALSE])
[1] TRUE

In programming, adding drop = FALSE is usually a good idea as it preserves the data structure. With vectors, the subsetting operator [] preserved the structure of the vector while [[ ]] acted as the simplifying operator. With matrices, [] act as the simplifying operator. To preserve the structure, you need to add drop = FALSE or drop = F.

You can subset multiple columns or rows. Suppose you need columns n to k of mat. You can subset these using mat[, n:k]. For instance, subsetting the 2nd to 4th column:

mat[, 2:4]
     [,1] [,2] [,3]
[1,]   12   13   41
[2,]   22   23   42
[3,]   32   33   43
[4,]   42   34   44

Note that in this case, the structure is preserved: the simplest possible data structure to show the result of the subsetting operation is a matrix.

Similarly, substting row m to l, is done using mat[m:l, ]. For instance, with m = 2 and l = 4 you subset the 2nd to 4th row:

mat[2:4, ]
     [,1] [,2] [,3] [,4]
[1,]   21   22   23   42
[2,]   31   32   33   43
[3,]   41   42   34   44

mat[m:l, n:k] subsets a range: the elements on row m to l and in columns n to k. For instance, if you need the elements in rows 2 to 4 and in columns 1 to 3:

mat[2:4, 1:3]
     [,1] [,2] [,3]
[1,]   21   22   23
[2,]   31   32   33
[3,]   41   42   34

If you need a specific number of rows or columns who are not in a range, you can identify them within vector using c(m, l, ...). For instance

  • subsetting row 1 and 3 and column 2 and 4
mat[c(1, 3), c(2, 4)]
     [,1] [,2]
[1,]   12   41
[2,]   32   43
  • subsetting all elements on row 1 and 3
mat[c(1, 3), ]
     [,1] [,2] [,3] [,4]
[1,]   11   12   13   41
[2,]   31   32   33   43
  • subsetting all elements in column 2 and 4
mat[, c(2, 4)]
     [,1] [,2]
[1,]   12   41
[2,]   22   42
[3,]   32   43
[4,]   42   44

Using negative index numbers, you tell R that you don’t want to extract those rows or columns. For instance, to show all elements in mat except those in the first row and first column, mat[-1, -1] shows:

mat[-1, -1]
     [,1] [,2] [,3]
[1,]   22   23   42
[2,]   32   33   43
[3,]   42   34   44

You can use negative indices to extract one or more rows or columns or ranges:

  • extracting all columns except columns 3 to 4
mat[, -3:-4]
     [,1] [,2]
[1,]   11   12
[2,]   21   22
[3,]   31   32
[4,]   41   42
  • extracting all rows except rows 1 to 3
mat[-1:-3, ]
[1] 41 42 34 44

Note that in this case, R simplifies the output to a vector (mat has 4 rows can you extract all except the first three). Here you have an example where you would change the data structure by subsetting all rows except 1. To avoid that, you can use the preserving operator:

mat[-1:-3, , drop = F]
     [,1] [,2] [,3] [,4]
[1,]   41   42   34   44
  • extracting all elements except those in columns 1 to 2 and rows 1 to 2
mat[-1:-2, -1:-2]
     [,1] [,2]
[1,]   33   43
[2,]   34   44

Note the you can select multiple columns or rows not in a range using -c(k, l), e.g. extracting all columns except 1 and 3:

mat[, -c(1, 3)]
     [,1] [,2]
[1,]   12   41
[2,]   22   42
[3,]   32   43
[4,]   42   44

4.2.3.2 Subsetting a named matrix

With names matrices, you can also refer to the names of the columns and rows. Let’s add row and column names to mat:

colnames(mat) <- colnames(mat, do.NULL = FALSE, prefix = "var_")
rownames(mat) <- rownames(mat, do.NULL = FALSE, prefix = "row_")
mat
      var_1 var_2 var_3 var_4
row_1    11    12    13    41
row_2    21    22    23    42
row_3    31    32    33    43
row_4    41    42    34    44

You can now subset this matrix using `mat[“rowname”, “columnname”]. For instance, extracting the element on row 2 and column 3:

mat["row_2", "var_3"]
[1] 23

Subsetting all elements in column var_3:

mat[, "var_3"]
row_1 row_2 row_3 row_4 
   13    23    33    34 

Note that R retains the row names in this case, however, you loose the structure of the matrix. To avoid this, add drop = F:

mat[, "var_3", drop = F]
      var_3
row_1    13
row_2    23
row_3    33
row_4    34

R also shows the names if you subset a named matrix using indices:

mat[, 3, drop = F]
      var_3
row_1    13
row_2    23
row_3    33
row_4    34

To extract all elements in row 2 and keep the structure:

mat["row_2", , drop = F]
      var_1 var_2 var_3 var_4
row_2    21    22    23    42

You can also collect the names a vector and subset multiple rows:

mat[c("row_1", "row_3"), ]
      var_1 var_2 var_3 var_4
row_1    11    12    13    41
row_3    31    32    33    43

or multiple columns:

mat[, c("var_1", "var_3")]
      var_1 var_3
row_1    11    13
row_2    21    23
row_3    31    33
row_4    41    34

or both:

mat[c("row_1", "row_3"), c("var_1", "var_3")]
      var_1 var_3
row_1    11    13
row_3    31    33

4.2.3.3 Subsetting using a logical matrix

Recall that you can subset a vector using a logical vector. For a matrix, this is also true. However, in this case, the result is not a matrix but a vector. This vector includes all elements for which the condition returned TRUE. Let’s create a random logical matrix:

cond = matrix(sample(c(TRUE, FALSE), 16, TRUE), nrow = 4, ncol = 4)
cond
      [,1]  [,2]  [,3]  [,4]
[1,] FALSE  TRUE FALSE FALSE
[2,]  TRUE FALSE FALSE FALSE
[3,] FALSE  TRUE  TRUE FALSE
[4,]  TRUE  TRUE FALSE FALSE

Here, we have a matrix whose elements are either TRUE or FALSE. We can use this matrix to extract the values in mat who are in the same position as the value TRUE in the matrix cond. To do so, we can use:

mat[cond]
[1] 21 41 12 32 42 33

Withing the [] you can include various conditions, for instance, to extract all elements in mat larger than 25, you can use

mat[mat > 25]
 [1] 31 41 32 42 33 34 41 42 43 44

You can further refine the condition and apply it to only one column or one row. For instance to extract all rows whose value in the first row is larger than 25 you can define this condition:

cond <- mat[, 1] > 25
cond
row_1 row_2 row_3 row_4 
FALSE FALSE  TRUE  TRUE 

If you include this condition in the subsetting operator for the rows, you’ll see all columns for the rows whose value in the first column in larger than 25:

mat[cond, ]
      var_1 var_2 var_3 var_4
row_3    31    32    33    43
row_4    41    42    34    44

Collecting the elements in a vector, allows you to verify if they are also elements in the matrix. For instance, extracting the elements in mat who are equal to 12, 22, 33, 44 or 55, is done using

mat[mat %in% c(12, 22, 33, 44, 55)]
[1] 12 22 33 44

Subsetting using logical conditions also allows you to subset a named matrix using regular expressions. Recall that the grepl() function outputs a logical vector. If a matrix has column or row names, you can use these in grepl() to extract observations (rows) or variables (columns) that match a regular expressions. Suppose that you want to extract all observations on row_1, row_2 and row_3. Here, a simple regular expression would be “[0-3]”. This regular expression matches all rows that include ”” and one digit equal to 0, 1, 2 or 3. grepl() needs to find matches in the row names of mat:

grepl(pattern = "_[0-3]", x = rownames(mat))
[1]  TRUE  TRUE  TRUE FALSE

We can now use this expression to extract the observations. To do so, you either create a vector cond to store the result of grepl() which you can then use to subset:

cond <- grepl(pattern = "_[0-3]", x = rownames(mat))
mat[cond, ]
      var_1 var_2 var_3 var_4
row_1    11    12    13    41
row_2    21    22    23    42
row_3    31    32    33    43

As an alternative, you use the grepl() in the subsettig operation:

mat[grepl(pattern = "_[0-3]", x = rownames(mat)), ]
      var_1 var_2 var_3 var_4
row_1    11    12    13    41
row_2    21    22    23    42
row_3    31    32    33    43

Using this last method is probably less likely to result in code that is easy to read. In other words, if the pattern is complex, it it in general a good idea to use the first method.

You can do the same with column names. For instance, extracting all variables var_2, var_3 and var_4 can be done through:

mat[, grepl(pattern = "_[2-4]", x = colnames(mat))]
      var_2 var_3 var_4
row_1    12    13    41
row_2    22    23    42
row_3    32    33    43
row_4    42    34    44

Combining both subsets both rows as well as columns:

mat[grepl(pattern = "_[0-3]", x = rownames(mat)), grepl(pattern = "_[2-4]", x = colnames(mat))]
      var_2 var_3 var_4
row_1    12    13    41
row_2    22    23    42
row_3    32    33    43

The subset(x, subset, select, drop = FALSE, ...) function allows you to extract columns, defined in select from the matrix x using a logical index defined in subset. Using this function, you can subset rows and select which columns R needs to return. For instance, to selects the rows in columns 1 and 4 of mat if the value in column 2 is larger than 20 mat[, 2] > 20, you would use:

subset(mat, subset = mat[, 2] > 20, select = c(1, 4))
      var_1 var_4
row_2    21    42
row_3    31    43
row_4    41    44

In the select argument, you can use the usual subsetting methods:

  • extracting all columns between 2 and 4
subset(mat, subset = mat[, 2] > 20, select = 2:4)
      var_2 var_3 var_4
row_2    22    23    42
row_3    32    33    43
row_4    42    34    44
  • extracting all but column 4:
subset(mat, subset = mat[, 2] > 20, select = -4)
      var_1 var_2 var_3
row_2    21    22    23
row_3    31    32    33
row_4    41    42    34

Using the row names of the matrix, you can also use grepl(). For instance, selecting columns 1 and 4 and only rows whose name includes “3” or “4” uses:

cond <- grepl(pattern = "row_[3-4]", rownames(mat))
subset(mat, cond, c(1, 4))
      var_1 var_4
row_3    31    43
row_4    41    44

If you don’t use select =, by default, R return all columns. Excluding the subset argument will return all selected columns.

4.2.3.4 Diagonal and lower and upper triangular parts

If you have a square matrix, you can extract the diagonal elements using diag(x). Extracting the diagonal elements from `mat:

diag(mat, names = TRUE)
[1] 11 22 33 44

To extract the upper or lower triangular part of a square matrix, there are two functions: upper.tri(x, diag = FALSE) and lower.tri(x, diag = FALSE). The first subsets the upper triangular part, excluding the diagonal. The second the lower triangular part. Both include the square matrix as the first argument. The second argument determines is the diagonal is included or not (default). The outcome is a logical vector that can be used to subset the matrix.

uptri <- upper.tri(mat, diag = FALSE)
uptri
      [,1]  [,2]  [,3]  [,4]
[1,] FALSE  TRUE  TRUE  TRUE
[2,] FALSE FALSE  TRUE  TRUE
[3,] FALSE FALSE FALSE  TRUE
[4,] FALSE FALSE FALSE FALSE

Applying these two function to mat:

mat[uptri]
[1] 12 13 23 41 42 43

Extracting the lower triangular part can be done is a similar way. If we add the diagonal,

lotri <- lower.tri(mat, diag = TRUE)
mat[lotri]
 [1] 11 21 31 41 22 32 42 33 34 44

Create 3x3 matrix, mat_0 as a sequence from 101-109.

Code
mat_0 <- matrix(101:109, 3, 3)

Using this matrix, extract

  • element in row 2 and column 3:
Code
mat_0[2, 3]
[1] 108
  • all values in column 2 and preserve the structure of the matrix in the result
Code
mat_0[, 2, drop = FALSE]
     [,1]
[1,]  104
[2,]  105
[3,]  106
  • all values in row 1 and preserve the structure of the matrix in the result
Code
mat_0[1, , drop = FALSE]
     [,1] [,2] [,3]
[1,]  101  104  107
  • all values except those in column 1:
Code
mat_0[, -1]
     [,1] [,2]
[1,]  104  107
[2,]  105  108
[3,]  106  109
  • all values in columns 1 and 3:
Code
mat_0[, c(1, 3)]
     [,1] [,2]
[1,]  101  107
[2,]  102  108
[3,]  103  109
  • alle values in rows 1 and 3:
Code
mat_0[c(1, 3), ]
     [,1] [,2] [,3]
[1,]  101  104  107
[2,]  103  106  109

Let’s now add names to mat_0: “obs_1”, … for the rows and “var_1” … for the columns:

Code
colnames(mat_0) <- colnames(mat_0, do.NULL = FALSE, prefix = "var_")
rownames(mat_0) <- rownames(mat_0, do.NULL = FALSE, prefix = "row_")

Using these names, extract

  • all values for var_1 preserving the matrix structure of the result:
Code
mat_0[, "var_1", drop = FALSE]
      var_1
row_1   101
row_2   102
row_3   103
  • all values for row_1 and row_3`
Code
mat_0[c("row_1", "row_3"), ]
      var_1 var_2 var_3
row_1   101   104   107
row_3   103   106   109

Extract all the values larger than 104:

mat_0[mat_0 > 104]
[1] 105 106 107 108 109

Which values are on the diagonal of mat_0?

Code
diag(mat_0)
[1] 101 105 109

Extract the lower triangular part of mat_0 excluding the diagonal.

Code
mat_0[upper.tri(mat_0, diag = FALSE)]
[1] 104 107 108

4.2.4 Changing elements in a matrix

You can change individual elements of a matrix by reassigning them a new value. Suppose you want to change the value on row 2 and column 3 of mat from 23 into 123, you can use

mat[2, 3] <- 123
mat
      var_1 var_2 var_3 var_4
row_1    11    12    13    41
row_2    21    22   123    42
row_3    31    32    33    43
row_4    41    42    34    44

If you want to change all values less than 25 into 0, you can subset using this condition and reassign the values of the elements where the condition is TRUE:

mat[mat < 25] <- 0
mat
      var_1 var_2 var_3 var_4
row_1     0     0     0    41
row_2     0     0   123    42
row_3    31    32    33    43
row_4    41    42    34    44

4.2.5 Changing dimensions of a matrix

4.2.5.1 Changing the number of rows or columns

Suppose that you have a matrix, mat with 6 rows and 20 columns:

mat <- matrix(1:120, 6, 20)

You can change the dimensions of this matrix using dim(). For instance, if you want to change this matrix into a 3x40 matrix:

dim(mat) <- c(3, 40)

Note that you can do this as long as the length of the matrix is unaffected. In other words, the number of elements in both matrices must be the same.

4.2.5.2 Adding row/columns to matrices

Using rbind() (row bind) and cbind() (column bind) functions you can combine vectors and matrices. The first, rbind() combines by rows: it stacks on vector or matrix on top of the other. To do so, the vectors and matrices that will be combined need to have the same number of columns. cbind() adds adds the vectors and matrices next to each other. The vectors and matrices in this function need to have the same number of rows.

Suppose that you have two matrices, mat_1 (filled with 1’s) and mat_2 (filled with 2’s):

mat_1 <- matrix(1, nrow = 3, ncol = 2)
mat_1
     [,1] [,2]
[1,]    1    1
[2,]    1    1
[3,]    1    1
mat_2 <- matrix(2, nrow = 3, ncol = 1)
mat_2
     [,1]
[1,]    2
[2,]    2
[3,]    2

They both have the same number of rows. That means that you can bind both and add the columns of mat_2 to those or mat_1

mat_c12 <- cbind(mat_1, mat_2)
mat_c12
     [,1] [,2] [,3]
[1,]    1    1    2
[2,]    1    1    2
[3,]    1    1    2

Note that cbind(mat2, mat1) would add the columns of mat_1 to those of mat_2:

mat_c21 <- cbind(mat_2, mat_1)
mat_c21
     [,1] [,2] [,3]
[1,]    2    1    1
[2,]    2    1    1
[3,]    2    1    1

You can add a matrix mat_3 (filled with 3’s) with the same number of columns as e.g. mat_1, you can add these rows to those of mat_1 using rbind():

mat_3 = matrix(3, nrow = 2, ncol = 2)

mat_r13 <- rbind(mat_1, mat_3)
mat_r13
     [,1] [,2]
[1,]    1    1
[2,]    1    1
[3,]    1    1
[4,]    3    3
[5,]    3    3

If you reserve the order, you would add the rows of mat_1 to those of mat_3:

mat_r31 <- rbind(mat_3, mat_1)
mat_r31
     [,1] [,2]
[1,]    3    3
[2,]    3    3
[3,]    1    1
[4,]    1    1
[5,]    1    1

4.2.5.3 Removing rows and columns

When we discussed subsetting a matrix, we introduced negative index positions to subset all but the rows/columns with a negative index. This is the first approach if you want to remove a row or a column. Suppose for instance that you want to remove the last two rows of mat_r31, you use their negative index positions and save the matrix as mat_r31. Note that you can specify the negative positions using a range or you can collect them in a vector and add a minus sign c(). Here, we wil use the last appraoch:

mat_r31 <- mat_r31[-c(4, 5), ]
mat_r31
     [,1] [,2]
[1,]    3    3
[2,]    3    3
[3,]    1    1

You can remove the first two columns from mat_c12 in a similar way:

mat_c12 <- mat_c12[, -1:-2]
mat_c12
[1] 2 2 2

The second approach uses a logical vector where a value TRUE will keep the row or column and a value FALSE will remove that column of row. To illustrate this approach, we’ll use mat. Suppose you want to remove columns 1 and 3. The logical vector would then be c(FALSE, TRUE, FALSE, TRUE). Using this vector to subset the matrix

mat_keep <- c(FALSE, TRUE, FALSE, TRUE)
mat[, mat_keep]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,]    4   10   16   22   28   34   40   46   52    58    64    70    76    82
[2,]    5   11   17   23   29   35   41   47   53    59    65    71    77    83
[3,]    6   12   18   24   30   36   42   48   54    60    66    72    78    84
     [,15] [,16] [,17] [,18] [,19] [,20]
[1,]    88    94   100   106   112   118
[2,]    89    95   101   107   113   119
[3,]    90    96   102   108   114   120

If you reassign this result to mat you have effectively removed columns 1 and 3. You can use a similar approach to keep/remove rows.

Note you don’t need to write the logical vector by hand. Usually, this vector will be the outcome of a condition.

4.2.5.4 Deconstructing a matrix

Deconstructing a matrix refers to the operation where change the dimension and change the matrix into a vector. To do so, you can use the c() function. This function changes the matrix into a vector. For instance, applying this function to mat results in a vector.

c(mat)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120

As you can see, the vector starts with the first column, then add the second column, the third and the fourth.

Using the next three matrices:

mat_1 <- matrix(1:12, 3, 4)
mat_2 <- matrix(11:22, 3, 4)
mat_3 <- matrix(31:42, 3, 4)
  • add columns of mat_3 to those of mat_1:
Code
cbind(mat_1, mat_3)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    4    7   10   31   34   37   40
[2,]    2    5    8   11   32   35   38   41
[3,]    3    6    9   12   33   36   39   42
  • add the rows of mat_2 to those of mat_1:
Code
rbind(mat_1, mat_2)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
[4,]   11   14   17   20
[5,]   12   15   18   21
[6,]   13   16   19   22
  • remove the 2nd row of mat_2
Code
mat_2[-2, ]
     [,1] [,2] [,3] [,4]
[1,]   11   14   17   20
[2,]   13   16   19   22
  • remove the 2nd row of mat_2 using a logical vector
Code
mat_2[c(T, F, T), ]
     [,1] [,2] [,3] [,4]
[1,]   11   14   17   20
[2,]   13   16   19   22
  • remove column 1 and 3 from mat_1. Do so in three ways:

  • Option 1:

Code
mat_1[, -c(1, 3)]
     [,1] [,2]
[1,]    4   10
[2,]    5   11
[3,]    6   12
  • Option 2:
Code
mat_1[, c(-1, -3)]
     [,1] [,2]
[1,]    4   10
[2,]    5   11
[3,]    6   12
  • Option 3:
Code
mat_1[, c(F, T, F, T)]
     [,1] [,2]
[1,]    4   10
[2,]    5   11
[3,]    6   12

Change the values of mat_1 on the upper triangular part, excluding the diagonal, to 0.

Code
mat_1[upper.tri(mat_1, diag = FALSE)] <- 0
mat_1
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    2    5    0    0
[3,]    3    6    9    0

Turn mat_3 into a vector.

Code
c(mat_3)
 [1] 31 32 33 34 35 36 37 38 39 40 41 42

4.2.6 Applying functions to a matrix

Recall that many function in R are vectorized. For matrices, that means that they apply to the individual elements of a matrix. This holds for most operators: they work on an element by element basis. However, note that in this case, this requires that the dimensions of both matrices are the same.

4.2.6.1 Operators

Operators such as addition, subtraction, division, multiplication, integer division or modulus can be used with matrices. Using

mat_1 <- matrix(c(2, 4, 8, 10), 2, 2)
mat_2 <- matrix(c(1, 2, 3, 4), 2, 2)

we’ll illustrate these operators.

  • addition:
mat_1 + mat_2
     [,1] [,2]
[1,]    3   11
[2,]    6   14
  • subtraction:
mat_1 - mat_2
     [,1] [,2]
[1,]    1    5
[2,]    2    6
  • multiplication:
mat_1 * mat_2
     [,1] [,2]
[1,]    2   24
[2,]    8   40
  • division
mat_1 / mat_2
     [,1]     [,2]
[1,]    2 2.666667
[2,]    2 2.500000
  • integer division
mat_1 %/% mat_2
     [,1] [,2]
[1,]    2    2
[2,]    2    2
  • modulus
mat_1 %% mat_2
     [,1] [,2]
[1,]    0    2
[2,]    0    2

4.2.6.2 Functions

Most functions are also vectorized. In other words, they work on an element by element basis if applied to a matrix. For instance

  • absolute value
abs(-1 * mat_1)
     [,1] [,2]
[1,]    2    8
[2,]    4   10
  • natural logarithm:
log(mat_1)
          [,1]     [,2]
[1,] 0.6931472 2.079442
[2,] 1.3862944 2.302585
  • power, e.g. n
mat_1^3
     [,1] [,2]
[1,]    8  512
[2,]   64 1000
  • square root
sqrt(mat_1)
         [,1]     [,2]
[1,] 1.414214 2.828427
[2,] 2.000000 3.162278
  • exponential function
exp(mat_1)
          [,1]      [,2]
[1,]  7.389056  2980.958
[2,] 54.598150 22026.466

Here, the functions are applied to all elements of the matrix mat_1. Note that this is not necessary. If you subset the column or rows of mat_1, R will apply a function only to the extracted rows or columns. This also holds for the mathematical operators. For instance:

  • adding the first column of mat_1 to the second column of mat_2:
mat_1[, 1] + mat_2[, 2]
[1] 5 8
  • natural logarithm of the first row of mat_1
log(mat_1[1, ])
[1] 0.6931472 2.0794415

4.2.6.3 Statistical functions

We covered a number of statistical function and discussed how they are applied to vectors. By extension, you can use these for matrices too. Subsetting a column, row or element allows you to apply these function to all elements in one or multiple rows, columns, elements or ranges. For instance,

  • area under the Student-t density with 5 degrees of freedom for the values in the first column of mat_2:
pt(mat_2[, 1], df = 5)
[1] 0.8183913 0.9490303
  • the probability that an F-distributed value with 6 and 2 degrees of freedom equals the values in first and the second row of mat_1:
df(mat_1[1:2, ], df1 = 6, df2 = 2)
           [,1]        [,2]
[1,] 0.13494377 0.013271040
[2,] 0.04537656 0.008770781

4.2.6.4 Functions on multiple columns/rows

In the previous section, we applied all functions to all elements of the matrix or a subset of columns and or rows. R includes a number of functions that you can apply to every column or every row. In addition, you can use the apply() function apply a function per row or per column.

4.2.6.4.1 Functions operating per row/column

R includes function that work per row or column of a matrix. To illustrate some these functions, we’ll use the following random matrix:

n = 1000
m = 5
matr <- matrix(rnorm(n * m), n, m)
matu <- matrix(runif(n * m, min = 0, max = 1), n, m)
colnames(matr) <- colnames(matr, do.NULL = FALSE, prefix = "var_")
colnames(matu) <- colnames(matu, do.NULL = FALSE, prefix = "var_")
rownames(matr) <- rownames(matr, do.NULL = FALSE, prefix = "obs_")
rownames(matu) <- rownames(matu, do.NULL = FALSE, prefix = "obs_")

The mean of every column of the first matrix, matr, should be (close to) zero and its standard deviation (close to) one. For the second matrix, where each element is drawn from a uniform distribution with minimum zero and maximum one, the sum of each column should be close to 500 (the number of observations multiplies with the expected value 0.5), the minimum should be close to zero and the maximum should be (close to) one.

R includes functions to calculate the means per column or per row: colMeans(x, na.rm = FALSE, dims = 1) and rowMeans(x, na.rm = FALSE, dims = 1). In both functions, x refers to the matrix. If you matrix include missing values, you need to change the second argument from FALSE to TRUE. The arguments dims = 1 allows you to specify which dimensions are regarded as row or column. You can lease this on its default value. Using these functions, you can calculate the mean per column:

colMeans(matr, na.rm = TRUE)
       var_1        var_2        var_3        var_4        var_5 
 0.005143367 -0.049817391 -0.001946084  0.020798152  0.045079932 

or the mean per row: rowMeans(matr, na.rm = TRUE). In this case, given the size of the matrix, the output would be very long.

Note that you can reduce the number of columns by subsetting the matrix. For instance, to determine the column means of columns var_2, var_3 and var_5, you can use the grepl() function to subset these columns:

colMeans(matr[, grepl(pattern = "_[2-4]", x = colnames(matr))], na.rm = TRUE)
       var_2        var_3        var_4 
-0.049817391 -0.001946084  0.020798152 

To calculate the sum of all values in a column or row R includes colSums(x, na.rm = FALSE) and rowSums(x, na.rm = FALSE). As with colMeans(), the first argument is the matrix while the second allows you to specify that missing values should be disregarded in the calculation or not. Using this function to calculate the sum of all values per column in matu:

colSums(matu, na.rm = TRUE )
   var_1    var_2    var_3    var_4    var_5 
498.8963 494.7191 510.6848 491.3499 488.0277 

You can use scale() to standardize the values per column. Recall that a standardized value is calculated as

\[ x_{stand} = {{(x - \overline{x})} \over{s}} \]

where \(\overline{x}\) is the mean and \(s\) is the standard deviation.

To illustrate this function, we’ll redefine matr and determine its values as draws from a normal distribution with mean 5 and standard deviation 10:

matr <- matrix(rnorm(n * m, 5, 10), n, m)

If you check the column means, you’ll see that the are (close to) 5

colMeans(matr)
[1] 5.300783 5.116847 5.083844 5.348894 4.590922

To standardize these values, you can use scale(x, center = TRUE, scale = TRUE). The first argument equals the matrix you want to scale. The second and third argument determine is you want to subtract the mean (i.e. you want to center the columns in matr) and divide by the standard deviation of every column in matr. By default, both are the case. Applying that function to matr:

matrs <- scale(matr, center = TRUE, scale = TRUE)

If you now look at the means per column,

colMeans(matrs)
[1] 1.811398e-17 6.473988e-18 1.907155e-17 2.709898e-17 4.207051e-17

you can verify that they are (close to) zero. The standard deviation is also (close to) one:

sd(matrs[, 1])
[1] 1
sd(matrs[, 2])
[1] 1
sd(matrs[, 3])
[1] 1
sd(matrs[, 4])
[1] 1
sd(matrs[, 5])
[1] 1

If you set one of the arguments, center or scale to FALSE, R will not center (i.e. will not subtract the means) or will not scale (i.e. will not divide by the standard deviation). In addition, you can supply your down vector that R will use to both center and scale. Suppose that you don’t want to center, but want to scale by the sum of all values in a column, can can use:

matrsum <- scale(matr, center = FALSE, scale = colSums(matr))

Using colSums(), you can verify this result:

colSums(matrsum)
[1] 1 1 1 1 1
4.2.6.4.2 The apply() function

The apply() function allows you to apply any function to each separate row or column of a matrix. Recall that we used this function in Chapter 2. The function includes a number of arguments: apply(X, MARGIN, FUN, ..., simplify = TRUE). The first, x refers to the matrix. The second, MARGIN = is used to determine is the function is applied to all rows (MARGIN = 1), all columns (MARGIN = 2) or to a subset of rows or columns. To apply to function to both, you can use MARGIN = c(1, 2). FUN refers to the function you want to apply to every column. The three dots ... refer to optional arguments for FUN. For instance, you can add na.rm = TRUE as an optional argument if FUN = sd. The last argument tells R to simplify the output if possible. For matrices, apply simplifies to a vector or array. In case simplify = FALSE R will return a list. A list is a data structure that we will discuss later in this chapter.

This function allows you to avoid for loops. Although it is not always possible to avoid for loops, in general they are slower than the apply() function. In other words, it is generally a good idea to try to use apply() as opposed to writing a for loop. For small datasets, the difference might be small. However, for larger datasets, the difference in efficiency can be quite large.

Let’s use a couple of examples to illustrate how you can use apply(). Here, we will apply a function to the columns. For rows, the output would be too long. Note that here too, you can subset the matrix you include in the first argument. Let’s start with the two functions we already met: the mean and sum of the columns. Using apply, you can calculate the mean of every column in matr

apply(matr, MARGIN = 2, FUN = mean, na.rm = TRUE, simplify = TRUE)
[1] 5.300783 5.116847 5.083844 5.348894 4.590922

For the sum of matu:

apply(matu, MARGIN = 2, FUN = sum, simplify = TRUE)
   var_1    var_2    var_3    var_4    var_5 
498.8963 494.7191 510.6848 491.3499 488.0277 

Using a for loop would require you to write

mat_mean <- matrix(0, 1, 5)
i = 1
for (i in 1:5) {
  mat_mean[i] <- mean(matr[, i])
}
mat_mean
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 5.300783 5.116847 5.083844 5.348894 4.590922

The argument for FUN can include most functions that we have seen so far. A couple of examples to illustrate:

  • the quantiles for every column of matr
apply(matr, MARGIN = 2, FUN = quantile, simplify = TRUE)
           [,1]       [,2]       [,3]       [,4]       [,5]
0%   -21.249655 -23.997095 -29.775836 -26.053802 -23.360434
25%   -1.191518  -1.251438  -1.893282  -1.246803  -2.013940
50%    4.960671   5.078630   5.033694   5.362717   4.899435
75%   11.814882  11.494527  11.955620  11.887836  10.981581
100%  37.325787  30.820108  38.755552  38.062730  35.365880
  • the standard deviation for every column of matr:
apply(matr, MARGIN = 2, FUN = sd, simplify = TRUE)
[1]  9.513057  9.520386 10.414700 10.239027  9.948180
  • finding the median value for every column in matr:
apply(matr, MARGIN = 2, FUN = median, simplify = TRUE)
[1] 4.960671 5.078630 5.033694 5.362717 4.899435
  • finding the mimimum value for every column of matu (recall, should be (close to) 0)
apply(matu, MARGIN = 2, FUN = min, simplify = TRUE)
       var_1        var_2        var_3        var_4        var_5 
3.130340e-04 3.611716e-05 3.060959e-03 3.400710e-04 1.325362e-03 
  • or the location of the minimum value in every column:
apply(matu, MARGIN = 2, FUN = which.min, simplify = TRUE)
var_1 var_2 var_3 var_4 var_5 
  926    54   956   283   579 
  • finding the maximum value for every column of matu (recall, should be (close to) 1):
apply(matu, MARGIN = 2, FUN = max, simplify = TRUE)
    var_1     var_2     var_3     var_4     var_5 
0.9981666 0.9997119 0.9987890 0.9996086 0.9992079 
  • or the location of the maximum value in every column:
apply(matu, MARGIN = 2, FUN = which.max, simplify = TRUE)
var_1 var_2 var_3 var_4 var_5 
   22    46   291   743   947 
  • cumulative sum for the first 10 rows of every column of matu
apply(matu[1:10, ], MARGIN = 2, FUN = cumsum, simplify = TRUE)
          var_1     var_2     var_3     var_4    var_5
obs_1  0.600226 0.0169425 0.1018651 0.6426288 0.158503
obs_2  1.265842 0.1729380 0.3665798 1.5988249 0.946418
obs_3  2.119343 1.1211914 0.4628187 1.9303313 1.093771
obs_4  2.847117 1.1250811 1.1040666 2.2613488 2.054247
obs_5  3.028297 2.1034462 1.1882292 2.4386040 2.696421
obs_6  3.540854 2.2688245 1.7402507 3.3594061 2.948909
obs_7  3.782852 2.7938194 1.7781946 4.3171831 3.733870
obs_8  4.271986 3.2511397 2.4019932 4.7598717 3.736695
obs_9  4.524177 3.9606292 2.7464903 4.8748816 3.979814
obs_10 4.985903 4.1790758 3.7290677 5.4866638 4.220611
  • cumulative product for the first 10 rows of every column of matu
apply(matu[1:10, ], MARGIN = 2, FUN = cumprod, simplify = TRUE)
              var_1        var_2        var_3        var_4        var_5
obs_1  0.6002259518 1.694250e-02 1.018651e-01 0.6426288278 1.585030e-01
obs_2  0.3995199498 2.642955e-03 2.696519e-02 0.6144791858 1.248869e-01
obs_3  0.3409906601 2.506191e-03 2.595101e-03 0.2037037471 1.840246e-02
obs_4  0.2481642950 9.748173e-06 1.664103e-03 0.0674295104 1.767512e-02
obs_5  0.0449622375 9.537273e-06 1.400551e-04 0.0119522310 1.135051e-02
obs_6  0.0230457362 1.577258e-06 7.731343e-05 0.0110056391 2.865859e-03
obs_7  0.0055770128 8.280525e-07 2.933578e-06 0.0105409480 2.249587e-03
obs_8  0.0027279104 3.786852e-07 1.829962e-06 0.0046663574 6.355042e-06
obs_9  0.0006879535 2.686732e-07 6.304165e-07 0.0005366777 1.545037e-06
obs_10 0.0003176458 5.869076e-08 6.194330e-07 0.0003283298 3.720400e-07
  • draw a sample from every column with size 100 and without replacement:
apply(matu, MARGIN = 2, FUN = sample, size = 100, replace = FALSE, simplify = TRUE)
             var_1        var_2       var_3      var_4       var_5
  [1,] 0.427874653 0.2962461817 0.245244351 0.71371006 0.614320555
  [2,] 0.619837658 0.9732767020 0.005746100 0.69770648 0.104410692
  [3,] 0.211551783 0.7025180887 0.634824334 0.41890317 0.751804105
  [4,] 0.937301772 0.8625655503 0.313025048 0.25229257 0.727244976
  [5,] 0.468054898 0.1333571775 0.412062724 0.14209116 0.205491589
  [6,] 0.652220312 0.1343263080 0.982577400 0.88488528 0.545954252
  [7,] 0.094731264 0.4035322617 0.792527967 0.20194269 0.846915320
  [8,] 0.963287048 0.6726470231 0.396185544 0.03091596 0.379442458
  [9,] 0.829897380 0.1997247315 0.415923977 0.20976465 0.203501021
 [10,] 0.395167737 0.0038896375 0.377768022 0.08304928 0.522725631
 [11,] 0.663212207 0.8908440617 0.932076235 0.68903437 0.069070730
 [12,] 0.010062481 0.5519522047 0.340680641 0.45924904 0.499460750
 [13,] 0.163756774 0.4289824814 0.263133530 0.51283482 0.205762443
 [14,] 0.312948626 0.6088482901 0.699145186 0.18986819 0.818035963
 [15,] 0.569327792 0.4906537151 0.495169261 0.37241806 0.360954494
 [16,] 0.688230235 0.3483616519 0.923396978 0.67985158 0.527013391
 [17,] 0.304477348 0.7016855879 0.738518027 0.22485392 0.699525842
 [18,] 0.914051189 0.0028812722 0.949652195 0.21667207 0.184232822
 [19,] 0.015076805 0.7671682036 0.071696028 0.84160925 0.434728433
 [20,] 0.680477128 0.7291652344 0.789702306 0.04536275 0.132914925
 [21,] 0.483864088 0.4406194480 0.028702055 0.01458980 0.045473437
 [22,] 0.090609562 0.3710276980 0.862361548 0.40796349 0.126826719
 [23,] 0.535610609 0.8934164054 0.939425986 0.86611405 0.031051245
 [24,] 0.406181956 0.3546411851 0.343321695 0.74156995 0.401845718
 [25,] 0.650372807 0.2254920613 0.007305878 0.96089933 0.294542913
 [26,] 0.523949534 0.0001744141 0.447297217 0.40851562 0.764654150
 [27,] 0.744043353 0.7912811625 0.052715830 0.96053204 0.409537329
 [28,] 0.976969536 0.1195863134 0.982976033 0.74644039 0.110714359
 [29,] 0.133260578 0.2870601334 0.115480395 0.13382330 0.122441713
 [30,] 0.214659605 0.9601216391 0.788732514 0.11746025 0.800846495
 [31,] 0.606783367 0.6674754317 0.860529469 0.04416643 0.575592778
 [32,] 0.701639672 0.8425107629 0.600073412 0.05594724 0.788355635
 [33,] 0.395247573 0.4100571568 0.167690868 0.70724876 0.945276574
 [34,] 0.997906438 0.2519476619 0.735430416 0.48066398 0.747088682
 [35,] 0.196408861 0.4243587430 0.596180666 0.51482828 0.957349570
 [36,] 0.727074796 0.9684854858 0.652410179 0.96595233 0.722289495
 [37,] 0.986973475 0.0895374368 0.845779282 0.41230179 0.293967669
 [38,] 0.773164391 0.9638337884 0.552748314 0.06159046 0.458007812
 [39,] 0.796141559 0.7765745323 0.846832885 0.20770468 0.298801046
 [40,] 0.624487828 0.3564885638 0.660334249 0.17683940 0.057621229
 [41,] 0.817350582 0.5913988259 0.220299744 0.86317435 0.299523436
 [42,] 0.967824784 0.4424968567 0.921304168 0.26098331 0.489219855
 [43,] 0.914439068 0.3942986336 0.361325487 0.60169566 0.616653974
 [44,] 0.768421780 0.4217687396 0.295858546 0.37798105 0.077182890
 [45,] 0.016625161 0.6658115482 0.165320277 0.77455860 0.158503026
 [46,] 0.440597997 0.9446910012 0.243012169 0.06447767 0.420961288
 [47,] 0.719860912 0.1772992392 0.560751369 0.44902747 0.849164016
 [48,] 0.839452247 0.8715715578 0.214128093 0.23395167 0.642936618
 [49,] 0.833640202 0.8292887262 0.674816251 0.66711147 0.277001954
 [50,] 0.338264248 0.3612342407 0.238634258 0.94576723 0.542285567
 [51,] 0.542008152 0.0192937264 0.760847150 0.01364760 0.358393266
 [52,] 0.392433255 0.1112235764 0.775153869 0.87165202 0.237003826
 [53,] 0.164499223 0.4162324592 0.275682220 0.19663832 0.074004117
 [54,] 0.369945183 0.4217562771 0.262902491 0.78702023 0.042206783
 [55,] 0.240512830 0.2073696717 0.976909903 0.05358947 0.167880482
 [56,] 0.414119643 0.8492104961 0.605697764 0.41609591 0.872418524
 [57,] 0.662489830 0.1843502908 0.716483586 0.26179877 0.819733626
 [58,] 0.758360146 0.3945711800 0.928629791 0.01249312 0.650508635
 [59,] 0.599016845 0.3895975223 0.367571383 0.91714007 0.002824981
 [60,] 0.336946693 0.1554408779 0.450853944 0.76029389 0.491443093
 [61,] 0.716182998 0.9103479120 0.725621383 0.01429185 0.686092847
 [62,] 0.061481926 0.0163614631 0.989081329 0.52692042 0.761924360
 [63,] 0.286311698 0.6886869357 0.764402661 0.19420963 0.645052908
 [64,] 0.743579587 0.6328592277 0.015155191 0.06716482 0.375276764
 [65,] 0.338893080 0.8474844724 0.048234599 0.92080208 0.868130674
 [66,] 0.955673186 0.1031871077 0.428569158 0.28122884 0.376567396
 [67,] 0.624607087 0.4536155986 0.790632120 0.38309418 0.860546991
 [68,] 0.575755496 0.4721420540 0.053433392 0.47605170 0.634406123
 [69,] 0.755386295 0.0770887041 0.696568140 0.17478258 0.801871024
 [70,] 0.254544970 0.6029461438 0.617348765 0.71324939 0.334514807
 [71,] 0.492098489 0.3417994273 0.806618107 0.64777243 0.158112442
 [72,] 0.981012912 0.7096717325 0.890726835 0.97584609 0.060581106
 [73,] 0.277054097 0.8711021238 0.514938284 0.22167477 0.422439821
 [74,] 0.211788388 0.8441557020 0.972471799 0.47317091 0.444995942
 [75,] 0.573928014 0.3699196551 0.576670796 0.15495689 0.749400497
 [76,] 0.895156737 0.3571989350 0.218822891 0.06512148 0.393396356
 [77,] 0.881618620 0.0430566731 0.635972946 0.93043863 0.766673522
 [78,] 0.682188959 0.7933296857 0.625962837 0.03882977 0.034771726
 [79,] 0.814984214 0.0663515609 0.524825637 0.84765934 0.265665035
 [80,] 0.877733724 0.4871458802 0.025809797 0.82977315 0.754634449
 [81,] 0.637299148 0.5170553513 0.833798896 0.46606657 0.207820970
 [82,] 0.801481901 0.2931252166 0.136746070 0.24331348 0.826765755
 [83,] 0.871730286 0.9466384773 0.064129959 0.64118513 0.762469654
 [84,] 0.159506244 0.6049182811 0.029450050 0.56676077 0.697369893
 [85,] 0.731072485 0.0996799695 0.164878438 0.58723558 0.221791439
 [86,] 0.094668780 0.7252484844 0.710831779 0.93143246 0.950283931
 [87,] 0.469556776 0.4755556541 0.319298754 0.60529539 0.649953627
 [88,] 0.508050772 0.0509314602 0.015070460 0.17068511 0.869120294
 [89,] 0.086548502 0.5792983784 0.353958581 0.61178214 0.111286720
 [90,] 0.997671658 0.5400455620 0.197374821 0.05923918 0.284796121
 [91,] 0.107591556 0.3836067128 0.052368591 0.14104294 0.417281280
 [92,] 0.358777107 0.7570024268 0.086379471 0.13378116 0.746380477
 [93,] 0.490966906 0.0319006587 0.891686819 0.50759548 0.242711229
 [94,] 0.459255106 0.5008176900 0.482638942 0.87273702 0.338060063
 [95,] 0.794867382 0.7031364362 0.456189326 0.09008759 0.870708651
 [96,] 0.007536766 0.5311022992 0.284598397 0.92246731 0.065699746
 [97,] 0.983863101 0.5435822248 0.992995270 0.76408524 0.590230260
 [98,] 0.483696888 0.5903762605 0.520435375 0.91285026 0.627955907
 [99,] 0.202509591 0.3787623248 0.370204954 0.84725329 0.556368871
[100,] 0.839834335 0.4521930083 0.804261910 0.18736976 0.368933453

Note that in this case, the ... in apply are used to include specify the sample size and replacement method: size = 100, replace = FALSE.

  • order each column in matr in decreasing order (here the first 10 rows):
apply(matr[1:10, ], 2, sort, decreasing = TRUE, TRUE )
           [,1]        [,2]        [,3]        [,4]        [,5]
 [1,] 15.460149  16.2546301 15.91465424  21.2399511 20.27354208
 [2,] 15.020452  13.5497243 12.07519698  18.7872667  6.64172071
 [3,]  8.922464  10.4209857  6.55576309  17.0451525  6.26640391
 [4,]  6.488385   9.3100825  5.60532134  10.5860546  4.51004705
 [5,]  4.711979   4.5994621  0.09526772   2.4596832  2.70702889
 [6,]  4.606092   2.4900037 -0.91912989   1.2480615  1.37693694
 [7,] -1.793389   1.0875137 -0.99093526   0.8535719 -0.01652879
 [8,] -3.080375  -0.3370674 -1.47375072  -2.9420104 -0.63829143
 [9,] -6.356346 -10.4651753 -3.40353956  -3.0203727 -2.86726235
[10,] -6.714175 -16.3735425 -8.44512193 -22.1994413 -4.65856598

Note that in this case, the ... in apply are used to include specify the order decreasing = TRUE.

You are not limited to these predefined functions. In the FUN argument, you can define your own function. As we will see in Chapter 14, R allows you to build your own function. In addition, you can add so called anonymous functions or lambda function in the FUN argument of apply. To do to, you first write function(x) or use the shorthand \(x) and add the body of your function, e.g. mean(x)/sd(x). With apply(), x refers to a column is MARGIN = 2 and to the row if MARGIN = 1. Note that there is no comma between function(x) and the body of your function. Adding all these into apply(): apply(mat, 1/2, function(x) mean(x)/sd(x)). This statement could be read as: for each row (if MARGIN = 1) or each column (is MARGIN = 2), substitute that row/column for x in the function function(x). In other words, and assuming that MARGIN = 2, R applies the function function(x) to mat[, 1], then to mat[, 3], … until is reaches the last column. Each time, R stores the outcome in a vector, matrix or list and adds, where possible, the name of that column to that vector, matrix or list. In the previous examples, FUN = mean was actually shorthand for function(x) mean(x). As mean is a known function in R, you don’t need to use function(x). As this function is the third argument after mat and MARGIN, you can further shorten the apply() code to apply(mat, 2, mean)

For instance, to standardize all columns in a matrix, you can define function(x) (x - mean(x))/sd(x) or \(x) (x - mean(x))/sd(x). These functions are anonymous because they don’t have a name. Other functions, such as mean() or functions that you will write yourself have a name. This allows you to use these function throughout your code. Anonymous functions or lambda function only exist when used within code, but can not be called in subsequent parts of your code.

To illustrate, let’s standardize the first 10 rows of all columns in matr after adding 5 and multiplying with 10. Although we could include this restriction in the apply() function, we will first create a matrix with the first 10 rows:

matr10 <- 5 + 10 * matr[1:10, ]
matr10
           [,1]        [,2]       [,3]       [,4]       [,5]
 [1,] -12.93389    1.629326 125.751970 -216.99441 -23.672623
 [2,] -58.56346   15.875137 164.146542   13.53572   4.834712
 [3,]  94.22464  -99.651753   5.952677  175.45152 -41.585660
 [4,] -25.80375  109.209857 -79.451219  217.39951 207.735421
 [5,] 159.60149  140.497243  70.557631  -25.20373  -1.382914
 [6,] -62.14175   50.994621  -9.737507   29.59683  32.070289
 [7,]  52.11979  167.546301 -29.035396   17.48061  50.100470
 [8,]  51.06092 -158.735425  61.053213  110.86055  67.664039
 [9,] 155.20452   98.100825  -4.191299  192.87267  71.417207
[10,]  69.88385   29.900037  -4.909353  -24.42010  18.769369

Using matr10, we can now write the apply() command to standardize:

apply(matr10, 2, function(x) (x - mean(x))/sd(x))
            [,1]        [,2]       [,3]       [,4]        [,5]
 [1,] -0.6822876 -0.32897513  1.2872629 -2.0395166 -0.88910479
 [2,] -1.2462908 -0.19075950  1.8035029 -0.2723077 -0.48205544
 [3,]  0.6422431 -1.31162383 -0.3235164  0.9689141 -1.14488071
 [4,] -0.8413651  0.71479209 -1.4718273  1.2904810  2.41511475
 [5,]  1.4503320  1.01834838  0.5451391 -0.5692784 -0.57083542
 [6,] -1.2905202  0.14997656 -0.5344812 -0.1491857 -0.09316522
 [7,]  0.1218070  1.28078360 -0.7939538 -0.2420668  0.16428339
 [8,]  0.1087188 -1.88486509  0.4173461  0.4737695  0.41506934
 [9,]  1.3959834  0.60701010 -0.4599088  1.1024619  0.46865992
[10,]  0.3413793 -0.05468719 -0.4695635 -0.5632713 -0.28308583

or

apply(matr10, 2, \(x) (x - mean(x))/sd(x))
            [,1]        [,2]       [,3]       [,4]        [,5]
 [1,] -0.6822876 -0.32897513  1.2872629 -2.0395166 -0.88910479
 [2,] -1.2462908 -0.19075950  1.8035029 -0.2723077 -0.48205544
 [3,]  0.6422431 -1.31162383 -0.3235164  0.9689141 -1.14488071
 [4,] -0.8413651  0.71479209 -1.4718273  1.2904810  2.41511475
 [5,]  1.4503320  1.01834838  0.5451391 -0.5692784 -0.57083542
 [6,] -1.2905202  0.14997656 -0.5344812 -0.1491857 -0.09316522
 [7,]  0.1218070  1.28078360 -0.7939538 -0.2420668  0.16428339
 [8,]  0.1087188 -1.88486509  0.4173461  0.4737695  0.41506934
 [9,]  1.3959834  0.60701010 -0.4599088  1.1024619  0.46865992
[10,]  0.3413793 -0.05468719 -0.4695635 -0.5632713 -0.28308583

Here, we used “(x)” as shorthand for “function(x)”.

Let’s now use the apply() function to

  • transform the values in mart10 using the min-max transformation (x - min(x))/(max(x) - min(x)). The outcome of this function rescales the values in every column to a 0-1 range:
apply(matr10, 2, \(x) (x - min(x))/(max(x) - min(x)))
            [,1]      [,2]      [,3]      [,4]       [,5]
 [1,] 0.22191368 0.4914917 0.8423854 0.0000000 0.07184726
 [2,] 0.01613708 0.5351527 1.0000000 0.5306937 0.18618711
 [3,] 0.70516872 0.1810818 0.3505939 0.9034333 0.00000000
 [4,] 0.16387424 0.8212084 0.0000000 1.0000000 1.00000000
 [5,] 1.00000000 0.9170991 0.6158055 0.4415133 0.16124888
 [6,] 0.00000000 0.6427882 0.2861837 0.5676673 0.29542608
 [7,] 0.51528761 1.0000000 0.2069634 0.5397751 0.36774319
 [8,] 0.51051238 0.0000000 0.5767887 0.7547411 0.43818878
 [9,] 0.98017090 0.7871610 0.3089516 0.9435378 0.45324233
[10,] 0.59539855 0.5781368 0.3060039 0.4433172 0.24207752
  • determine the number of values that are positive:
apply(matr10, 2, function(x) sum(x > 0))
[1] 6 8 5 7 7
  • add to every value in every column the absolute value of the difference between 0 and the minimum value (i.e. make every value at least equal to 0 or larger than 0)
apply(matr10, 2, \(x) x + abs(min(x) - 0))
            [,1]      [,2]      [,3]     [,4]      [,5]
 [1,]  49.207857 160.36475 205.20319   0.0000  17.91304
 [2,]   3.578288 174.61056 243.59776 230.5301  46.42037
 [3,] 156.366395  59.08367  85.40390 392.4459   0.00000
 [4,]  36.338003 267.94528   0.00000 434.3939 249.32108
 [5,] 221.743237 299.23267 150.00885 191.7907  40.20275
 [6,]   0.000000 209.73005  69.71371 246.5912  73.65595
 [7,] 114.261542 326.28173  50.41582 234.4750  91.68613
 [8,] 113.202667   0.00000 140.50443 327.8550 109.24970
 [9,] 217.346269 256.83625  75.25992 409.8671 113.00287
[10,] 132.025602 188.63546  74.54187 192.5743  60.35503
  • calculate the probability that you find a value for a mean for each column in matr10 which is larger 5 using a t-distribution with 9 degrees of freedom:
apply(matr10, 2, function(x) pt(mean(x)-5, df = 9, lower.tail = FALSE))
[1] 1.789074e-11 1.060196e-10 6.264105e-10 3.993985e-12 4.521195e-11

4.2.7 Matrix algebra

Using * R caculates the product of two matrices element-wise. You can also calculate the product of two matrices. In addition, R allows you to calculate e.g. the determinant of a (square) matrix. To illustrate we’ll use one square matrix A

A <- matrix(c(147, 258, 369, 123, 456, 789, 159, 483, 267), 3, 3)
A
     [,1] [,2] [,3]
[1,]  147  123  159
[2,]  258  456  483
[3,]  369  789  267

and a column vectors with 3 rows x:

x <- matrix(c(5, 10, 15), 3, 1)
x
     [,1]
[1,]    5
[2,]   10
[3,]   15

Using these two matrices, you can now do matrix algebra. For instance:

  • transpose of a matrix (change rows into columns), you can use t(). For the square matrix A the element in position (i, j) changes position to (j, i).
t(A)
     [,1] [,2] [,3]
[1,]  147  258  369
[2,]  123  456  789
[3,]  159  483  267

As you can see, 258, which is in position (2, 1) in A is now located in position (1, 2). Applying the transpose to x changes this vector from a column vector into a row vector:

t(x)
     [,1] [,2] [,3]
[1,]    5   10   15
  • matrix multiplication. Recall that matrix multiplication requires that the number of columns in the first matrix equals the number of rows in the second and that the outcome is a matrix with dimension (nrow(first), ncol(second)). Here, A is a 3x3 matrix and x is a 3x1. Using %*% you can multiply both. The outcome is a 3x1 matrix.
A %*% x
      [,1]
[1,]  4350
[2,] 13095
[3,] 13740

The element in position (1, 1) is equal to 147 * 5 + 123 * 10 + 159 * 15, the element in position (2, 1) is equal to 285 * 5 + 456 * 10 + 483 * 15 and the element in (3, 1) is equal to 369 * 5 + 789 * 10 + 267 * 15.

  • determinant of a matrix is only defined for square matrices. Here, A is a square matrix so we can use det(A) to calculate the determinant;
det(A)
[1] -19060920

Here, det(A) = 147 * 456 * 267 + 123 * 483 * 369 + 258 * 789 * 159 - 159 * 456 * 369 - 123 * 258 * 267 - 147 * 789 * 483. Here, the determinant is different from zero. In other words, the columns in A are linearly independent.

  • trace if a matrix is only defined for square matrices. The trace equals the sum of the elements on the diagonal of the matrix:
sum(diag(A))
[1] 870
  • inverse of a matrix. To determine the inverse of a matrix, you can use the solve(a, b) function. In general, this function solves a system of equations ax = b. If there is no value, b is set equal to the identity matrix. In that case, solve calculates the inverse:
solve(A)
             [,1]         [,2]          [,3]
[1,]  0.013605587 -0.004858632  0.0006870078
[2,] -0.005736397  0.001018943  0.0015727992
[3,] -0.001851852  0.003703704 -0.0018518519
  • eigenvalues of a matrix
eigen(A)$values
[1] 1079.22223 -273.74185   64.51962
  • eigenvectors of a matrix
eigen(A)$vectors
           [,1]       [,2]       [,3]
[1,] -0.2102804 -0.1744814 -0.9081183
[2,] -0.6517779 -0.4998281  0.3790710
[3,] -0.7286753  0.8483679  0.1778379

From you mathematics class, you may recall that the solution of a system of equations

\[ Ax = B \]

equals

\[ x = BA^{-1}. \]

Let’s first define B:

B <- A %*% x

In R, you can find the solution as solve(A, B):

solve(A, B)
     [,1]
[1,]    5
[2,]   10
[3,]   15

Recall that we used this function to calculate the inverse. There, this function sets B equal to the identity matrix, in other words

\[ x = BA^{-1} = IA^{-1} = A^{-1} \]

4.2.8 Other matrix functions

There are many other packages available that you can install and use to do matrix calculations. For instance, {matrixStats} is a package that includes many functions that apply to rows and columns of a matrix. To use the package, you need to install if first. To do so, use install.packages("matrixStats"). The functions in this package are faster and more memory efficient than using apply. You can find all the functions in that package in Bengtsson (2025) .

Using matA and matB,

matA <- matrix(1:16, 4, 4)
matB <- matrix(101:116, 4, 4)
  • add matA to matB:
Code
matA + matB
     [,1] [,2] [,3] [,4]
[1,]  102  110  118  126
[2,]  104  112  120  128
[3,]  106  114  122  130
[4,]  108  116  124  132
  • multiply matA with matB:
Code
matA * matB
     [,1] [,2] [,3] [,4]
[1,]  101  525  981 1469
[2,]  204  636 1100 1596
[3,]  309  749 1221 1725
[4,]  416  864 1344 1856
  • take the natural logarithm of the 2nd column of matB:
Code
log(matB[, 2])
[1] 4.653960 4.663439 4.672829 4.682131
  • determine the probability that the you will have a value less than or equal to those in the 1st column of matA if these values follow a normal distribution with mean 2 and standard deviation 1.5. What do you expect for the value matA[2, 1] = 2?
Code
pnorm(matA[, 1], 2, 1.5)
[1] 0.2524925 0.5000000 0.7475075 0.9087888
  • determine the means and the sum of every column in matB:
Code
colMeans(matB)
[1] 102.5 106.5 110.5 114.5
Code
colSums(matB)
[1] 410 426 442 458
  • determine the mean and the sum of every row in matA
Code
rowMeans(matA)
[1]  7  8  9 10
Code
rowSums(matA)
[1] 28 32 36 40
  • standardize the values of matB:
Code
scale(matB, center = TRUE, scale = TRUE)
           [,1]       [,2]       [,3]       [,4]
[1,] -1.1618950 -1.1618950 -1.1618950 -1.1618950
[2,] -0.3872983 -0.3872983 -0.3872983 -0.3872983
[3,]  0.3872983  0.3872983  0.3872983  0.3872983
[4,]  1.1618950  1.1618950  1.1618950  1.1618950
attr(,"scaled:center")
[1] 102.5 106.5 110.5 114.5
attr(,"scaled:scale")
[1] 1.290994 1.290994 1.290994 1.290994
  • subtract from the mean of column i from the values in that column of matA:
Code
scale(matA, center = TRUE, scale = FALSE)
     [,1] [,2] [,3] [,4]
[1,] -1.5 -1.5 -1.5 -1.5
[2,] -0.5 -0.5 -0.5 -0.5
[3,]  0.5  0.5  0.5  0.5
[4,]  1.5  1.5  1.5  1.5
attr(,"scaled:center")
[1]  2.5  6.5 10.5 14.5

Using the apply() function and simplifying your results:

  • determine the quantiles for every column of matB:
Code
apply(matB, MARGIN = 2, FUN = quantile, simplify = TRUE)
       [,1]   [,2]   [,3]   [,4]
0%   101.00 105.00 109.00 113.00
25%  101.75 105.75 109.75 113.75
50%  102.50 106.50 110.50 114.50
75%  103.25 107.25 111.25 115.25
100% 104.00 108.00 112.00 116.00
  • determine the quantiles for every row of matB:
Code
apply(matB, MARGIN = 1, FUN = quantile, simplify = TRUE)
     [,1] [,2] [,3] [,4]
0%    101  102  103  104
25%   104  105  106  107
50%   107  108  109  110
75%   110  111  112  113
100%  113  114  115  116
  • find the location of the minimum for every column of matA (what do you expect?)
Code
apply(matA, MARGIN = 2, FUN = which.min, simplify = TRUE)
[1] 1 1 1 1
  • rescale the values every row in matA by subtracting the minimum of that row and dividing the the difference between the maximum and mimimum for that row
Code
apply(matA, 1, function(x) (x - mean(x))/(max(x) - min(x)), simplify = TRUE)
           [,1]       [,2]       [,3]       [,4]
[1,] -0.5000000 -0.5000000 -0.5000000 -0.5000000
[2,] -0.1666667 -0.1666667 -0.1666667 -0.1666667
[3,]  0.1666667  0.1666667  0.1666667  0.1666667
[4,]  0.5000000  0.5000000  0.5000000  0.5000000
  • subtract, from every value in every column in matB the median value for the column and divide by the standard deviation of the column:
Code
apply(matB, 2, \(x) (x - median(x)/sd(x)), simplify = TRUE)
         [,1]     [,2]     [,3]     [,4]
[1,] 21.60384 22.50545 23.40707 24.30868
[2,] 22.60384 23.50545 24.40707 25.30868
[3,] 23.60384 24.50545 25.40707 26.30868
[4,] 24.60384 25.50545 26.40707 27.30868
  • determine for every column of matB if its mean is different from 101 at the 5% level using Student’s t-distribution with 3 degree of freedom. Show the resuls as TRUE is the mean is different and FALSE otherwise. Do so in one line of code within the apply() function.
Code
apply(matB, 2, function(x) (pt(mean(x)-101, df = 3, lower.tail = FALSE)) <= 0.05)
[1] FALSE  TRUE  TRUE  TRUE

Using matC

vec1 <- sample(c(letters, LETTERS), 16)
vec2 <- sample(c(letters, LETTERS), 16)
matC <- matrix(paste(vec1, vec2, sep = "_"), 8, 2)
matC
     [,1]  [,2] 
[1,] "V_O" "d_i"
[2,] "W_u" "M_R"
[3,] "R_c" "t_L"
[4,] "q_W" "s_M"
[5,] "S_F" "i_P"
[6,] "k_r" "Y_Q"
[7,] "c_t" "f_K"
[8,] "e_G" "H_s"
  • determine for each column how many times the pattern “uppercase.lowercase” (e.g. A.b) occurs:
Code
apply(matC, 2, function(x) sum(grepl(pattern = "[A-Z]_[a-z]", x)), simplify = TRUE)
[1] 2 1
  • determine for each row how many times the patterns “lowercase.lowercase” or “lowercase.uppercase” (e.g. n_k or n_K) occurs:
Code
apply(matC, 1, function(x) sum(grepl(pattern = "[a-z]_[a-zA-Z]", x)), simplify = TRUE)
[1] 1 0 1 2 1 1 2 1

Using matD and matE

matD <- matrix(rnorm(5, 5, 10), 5, 1)
matE <- matrix(rnorm(5, 5, 10), 5, 1)
  • calculate the transpose of matD:
Code
t(matD)
         [,1]       [,2]    [,3]      [,4]      [,5]
[1,] 13.75133 -0.1315757 1.75805 -8.878751 -1.445147
  • calculate the matrix product of the transpose of matD and matE
Code
t(matD) %*% matE
          [,1]
[1,] -205.5606
  • subtract from matD, multiply the transpose of this matrix with itself, divide by the number of rows - 1 and take the square root:
Code
matDscale <- scale(matD, center = TRUE, scale = FALSE)
sqrt((t(matDscale) %*% matDscale)/(nrow(matD) - 1))
         [,1]
[1,] 8.185649
  • calculate the standard deviation of matD:
Code
sd(matD)
[1] 8.185649

What do you see if you compare the outcome of the last two calculation? Do you know why that is the case?

4.3 Arrays

Vectors are uni-dimensional. Matrices are two-dimensional. Arrays allow you to store data in more than two dimensions. You can think of arrays as a series of matrices of the same dimensions. Like matrices, they are homogeneous: all values in an array have the same type. In other words, matrices are a special case of arrays: they are arrays with 1 matrix.

4.3.1 Creating a array

To create an array, you can use the array() function. This function needs the data to be stored in a array and the dimensions of the array stored in a vector. Here, you need three: nrow, ncol and nmat. The array() function read the data and determines the dimension from c(nrow, ncol, nmat). For stance, to create an array with 2 3x3 matrices, you can use

arr <- array(1:18, c(3, 3, 2))
arr
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

The array arr included 2 matrices. Both are 3x3 matrices. The first includes the values 1 to 9 and the second the values 10 to 18. Note that R stores the values first by matrix and then by column within a matrix.

To see if arr is an array, you check its class

class(arr)
[1] "array"

In addition, to see the type of values of an array, you can use

typeof(arr)
[1] "integer"

In this case, R read the values as integers. As an alternative, you can verify if arr is an array using:

is.array(arr)
[1] TRUE

You can check the dimensions of an array using

dim(arr)
[1] 3 3 2

Let’s see what happens if the data in the array() function has less values than the number of elements in the array:

arr <- array(1:6, c(3, 3, 2))
arr
, , 1

     [,1] [,2] [,3]
[1,]    1    4    1
[2,]    2    5    2
[3,]    3    6    3

, , 2

     [,1] [,2] [,3]
[1,]    4    1    4
[2,]    5    2    5
[3,]    6    3    6

As was the case with matrices, R uses some values in the data more than once. Here, the first matrix is filled by column. As the data only includes numeric values from 1 to 6, uses the first three values of the data, 1 to 3, again to fill the last column of the first matrix. To fill the second matrix, R continues with the values 4 to 6 to store the first column of the second matrix. To fill the last 2 columns of the second matrix, R uses the data a third time.

If the data include more values then there are elements in the array, e.g.

arr <- array(1:24, c(3, 3, 2))
arr
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

R uses the values it needs to store the array and leave all other out. Here, there are 24 values to store in an 18 element array. R uses only the first 18.

Suppose you have two matrices, matc1 and matc2

matc1 <- matrix(1:9, 3, 3)
matc2 <- matrix(11:19, 3, 3)

You can collect these into an array using cbind(). The function creates a new matrix adding the columns of matc2 to those of matc1. To fill the array when the data is a matrix, R starts to fill the array with the elements in all rows on the first column, then moves to all rows of the second column, … .

cbind(matc1, matc2)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    4    7   11   14   17
[2,]    2    5    8   12   15   18
[3,]    3    6    9   13   16   19

In other words, to fill array from 2 3x3 matrices, you can use:

arrc <- array(cbind(matc1, matc2), c(3, 3, 2))
arrc
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19

You can add names to the rows, columns and matrices in an array. You can do so in a number of ways. First you can include the names in a list in the the array() function. This lists first show the row names, than the column names followed by the matrix names:

arr <- array(1:18, c(3, 3, 2), 
             dimnames = list(c("year_1", "year_2", "year_3"), 
                             c("var_1", "var_2", "var_3"), 
                             c("firm_1", "firm_2")))
arr
, , firm_1

       var_1 var_2 var_3
year_1     1     4     7
year_2     2     5     8
year_3     3     6     9

, , firm_2

       var_1 var_2 var_3
year_1    10    13    16
year_2    11    14    17
year_3    12    15    18

To verify the names of an array, you can use dimnames(arr):

dimnames(arr)
[[1]]
[1] "year_1" "year_2" "year_3"

[[2]]
[1] "var_1" "var_2" "var_3"

[[3]]
[1] "firm_1" "firm_2"

Here, R returns a list. To extract the names for the rows, columns or matrices, you add [1], [2] or [3]. Doing so, you extract list with their names. For instance, for the rows:

dimnames(arr)[1]
[[1]]
[1] "year_1" "year_2" "year_3"

Second, you can add the names to an existing array. To do so for arrc, you can use

dimnames(arrc) <- list(c("year_1", "year_2", "year_3"), 
                             c("var_1", "var_2", "var_3"), 
                             c("firm_1", "firm_2"))

You can verify these names in a similar way

dimnames(arrc)
[[1]]
[1] "year_1" "year_2" "year_3"

[[2]]
[1] "var_1" "var_2" "var_3"

[[3]]
[1] "firm_1" "firm_2"

The dimensions of an array and the dimnames are attributes of an array. To see this, we can ask R to show the attributes of arrc:

attributes(arrc)
$dim
[1] 3 3 2

$dimnames
$dimnames[[1]]
[1] "year_1" "year_2" "year_3"

$dimnames[[2]]
[1] "var_1" "var_2" "var_3"

$dimnames[[3]]
[1] "firm_1" "firm_2"

R returns a list. To access these values, you can use, e.g.

attributes(arrc)$dim
[1] 3 3 2
attributes(arrc)$dimnames[1]
[[1]]
[1] "year_1" "year_2" "year_3"

4.3.2 Subsetting an array

To subset and array, you can use an approach which is very similar to the approach for matrices and vectors: subsetting by position, by name of by logical condition.

4.3.2.1 Subsetting by position

To illustrate, we will use the following array:

vec1 <- c(111, 211, 311, 121, 221, 321, 131, 231, 331, 112, 212, 312, 122, 222, 322, 132, 232, 332)
rown <- paste("year", 1:3, sep = "_")
coln <- paste("var", 1:3, sep = "_")
matn <- paste("firm", 1:2, sep = "_")
arr <- array(vec1, c(3, 3, 2), dimnames = list(rown, coln, matn))
arr
, , firm_1

       var_1 var_2 var_3
year_1   111   121   131
year_2   211   221   231
year_3   311   321   331

, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332

The values are equal to the row number, column number and matrix number. To subset an array using index positions, you include them in [i, j, k]. The first index position refers to the row, the second to the column and the third to the matrix. For instance, to extract the value in the third row of the second column in the first matrix:

arr[3, 2, 1]
[1] 321

Note that R simplifies the result. In other words, [] acts as a simplifying subsetting operator. To preserve the structure, you need to add drop = FALSE. Doing so, R will keep the structure of the data:

arr[3, 2, 1, drop = FALSE]
, , firm_1

       var_2
year_3   321

To subset one value from both matrices, you can use [i, j, ]. Here, you leave the third dimension (the matrix) open. R will show the results in a simplified way unless you add drop = FALSE. For instance, the element on the first row and first column of both matrices equals:

arr[1, 1, ]
firm_1 firm_2 
   111    112 

As you can see, R simplifies to result to a vector. Adding drop = FALSE preserves the structure of the data:

arr[1, 1, , drop = FALSE]
, , firm_1

       var_1
year_1   111

, , firm_2

       var_1
year_1   112

You can extract the values on all rows iof one column in one matrix k using `[i, , k]. For instance to subset the all values on the first row of the first matrix:

arr[1, , 1]
var_1 var_2 var_3 
  111   121   131 

[, j, k] subsets the values on all rows in column j of matrix k. For instance, to see the values for the second column of the second matrix:

arr[, 2, 2]
year_1 year_2 year_3 
   122    222    322 

If you leave two positions open, you extract

  • one matrix (e.g. the second matrix)
arr[, , 2, drop = FALSE]
, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332
  • one column in all matrices (e.g. he second column)
arr[, 2, , drop = FALSE]
, , firm_1

       var_2
year_1   121
year_2   221
year_3   321

, , firm_2

       var_2
year_1   122
year_2   222
year_3   322
  • one row in all matrices (e.g. the thrid row)
arr[3, , ]
      firm_1 firm_2
var_1    311    312
var_2    321    322
var_3    331    332

There are two ways to subset multiple row, columns or matrices from an array. The first uses the colon and subsets a range from x to y: x:y. For instance, rows 1 to 2 from column 1 and matrix 1 to 2:

arr[1:2, 1, 1:2]
       firm_1 firm_2
year_1    111    112
year_2    211    212

Collecting all rows, columns or matrixes you want to subset in a vector using c() allows you to subset these row, columns and matrices individually. For instance, subsetting rows 1 and 3 and columns 1 and 3 from matrices 1 and 2:

arr[c(1, 3), c(1, 3), c(1, 2)]
, , firm_1

       var_1 var_3
year_1   111   131
year_3   311   331

, , firm_2

       var_1 var_3
year_1   112   132
year_3   312   332

Using negative indices, you can subset all rows/columns/matrices except those with a negative index number. In the previous example, we extracted all values except row and column 2 from all matrices. You would do the same using negative index positions using:

arr[-2, -2, ]
, , firm_1

       var_1 var_3
year_1   111   131
year_3   311   331

, , firm_2

       var_1 var_3
year_1   112   132
year_3   312   332

4.3.2.2 Subsetting by name

You can also use the names of the row, columns and matrices to subset. To do so, you include the names in quotation marks within the subsetting operator. For instance:

  • subset one element:
arr["year_1", "var_1", "firm_1"]
[1] 111
  • subset one row in one matrix
arr[, "var_1", "firm_1"]
year_1 year_2 year_3 
   111    211    311 
  • subset one column in one matrix
arr["year_3", , "firm_2"]
var_1 var_2 var_3 
  312   322   332 
  • subset one row and one column in all matrices
arr["year_3", "var_2", ]
firm_1 firm_2 
   321    322 
  • subset one row for all columns and matrices
arr["year_3", , ]
      firm_1 firm_2
var_1    311    312
var_2    321    322
var_3    331    332
  • subset one column for all rows and matrices
arr[, "var_3", ]
       firm_1 firm_2
year_1    131    132
year_2    231    232
year_3    331    332
  • subset one matrix
arr[, , "firm_2"]
       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332

4.3.2.3 Subsetting by logical condition

As you could with matrices, you can subset an array with a logical array. Let’s first create a random logical array:

cond <- array(sample(c(TRUE, FALSE), size = 18, replace = TRUE), c(3, 3, 2))
cond
, , 1

      [,1]  [,2]  [,3]
[1,]  TRUE FALSE FALSE
[2,] FALSE  TRUE  TRUE
[3,] FALSE FALSE FALSE

, , 2

      [,1]  [,2] [,3]
[1,] FALSE  TRUE TRUE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE TRUE
arr[cond]
[1] 111 221 231 122 132 232 332

You can create these logical conditions in many ways. For instance, if you want to extract all values larger then 200, you can use this condition in the subsetting operator:

arr[arr > 200]
 [1] 211 311 221 321 231 331 212 312 222 322 232 332

You can refine this condition. For instance, if you want to extract all values for the rows and matrices where the first column of the first matrix is larger than 200, you can define the following condition:

cond <- arr[, 1, 1] > 200
cond
year_1 year_2 year_3 
 FALSE   TRUE   TRUE 

As you can see, there are two values in the first column of the first matrix who are larger than 200. These values are in row 2 and 3. You can now use this condition to extract the values for rows 2 and 3 for all columns and in both matrices:

arr[cond, , ]
, , firm_1

       var_1 var_2 var_3
year_2   211   221   231
year_3   311   321   331

, , firm_2

       var_1 var_2 var_3
year_2   212   222   232
year_3   312   322   332

Recall that for a matrix, you could use grepl() to subset row or column names. With arrays, you can also subset matrix names. For instance, to extract all data (full matrix) for the matrix whose name include a digit, you can use the pattern “_[2-3]” to extract all matrices whose name end with a 2 or 3. To do so, you need the matrix names. You can extract these names using

dimnames(arr)
[[1]]
[1] "year_1" "year_2" "year_3"

[[2]]
[1] "var_1" "var_2" "var_3"

[[3]]
[1] "firm_1" "firm_2"

The output of this function is a list. To extract the values of the third variable in this list, you can use the double subsetting operator [[ ]]: dimnames[[3]]. We”ll cover that operator more in depth when we discuss lists. Now you have all the information you need to extract the values:

arr[, , grepl(pattern = "_[2-3]", x = dimnames(arr)[[3]])]
       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332

4.3.3 Changing an array

4.3.3.1 Changing elements of an array

As you could with matrices, you can change an individual value or a range of values by subsetting that value or range and reassigning a different value. For instance, to multiply all values in the second column of the first matrix with 10:

arr[, 2, 1] <- arr[, 2, 1] * 10
arr[, , 1]
       var_1 var_2 var_3
year_1   111  1210   131
year_2   211  2210   231
year_3   311  3210   331

4.3.4 Changing an array’s dimensions

4.3.4.1 Adding matrices to an array

To add a matrix to an array, you can use the abind() function of the {abind} package. This package is usually installed. Let’s first define a new array, arr1. We know that we will add it to arr. In other words, we can use the names of the rows and columns in arr to create the names for the rows and columns in the new array arr1. To do so, we use dimnames(arr)[[1]] for the row names and dimnames(arr)[[2]] for the column names. To be consistent with the naming of matrices, I”ll use “firm_3” for the matrix name. Using array():

arr1 <- array(c(113, 213, 313, 123, 223, 323, 133, 233, 333), c(3, 3, 1), dimnames = list(dimnames(arr)[[1]],dimnames(arr)[[2]], c("firm_3")))
arr1
, , firm_3

       var_1 var_2 var_3
year_1   113   123   133
year_2   213   223   233
year_3   313   323   333

The abind() function has many options. Here, we will keep all default values and add the matrix as the last matrix in the array. To do so with the abind function uses:

abind::abind(arr, arr1)
, , firm_1

       var_1 var_2 var_3
year_1   111  1210   131
year_2   211  2210   231
year_3   311  3210   331

, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332

, , firm_3

       var_1 var_2 var_3
year_1   113   123   133
year_2   213   223   233
year_3   313   323   333

As you can see, the array has 3 matrices: firm_1, firm_2 and firm_3. Here, I used an array, but you can also add a matrix.

A second way starts from the deconstruction of the array. Recall that c() applies to a matrix turns the matrix into a vector. The same holds for an array. After deconstruction, you can append that vector with your new values for your matrix. Doing so, you have all the elements that you need to rebuild an array. For instance,

arr_new <- array(cbind(c(arr), c(113, 213, 313, 123, 223, 323, 133, 233, 333)), c(3, 3, 3), dimnames = list(dimnames(arr)[[1]],dimnames(arr)[[2]], c(dimnames(arr)[[3]], "firm_3")))
arr_new
, , firm_1

       var_1 var_2 var_3
year_1   111  1210   131
year_2   211  2210   231
year_3   311  3210   331

, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332

, , firm_3

       var_1 var_2 var_3
year_1   113   123   133
year_2   213   223   233
year_3   313   323   333

To add rows or columns to the matrices, you can first collect them in a separate matrix:

mat_1 <- arr[, , 1]
mat_2 <- arr[, , 2]

Using cbind() or rbind() you can now add new rows or columns. For instance, let’s add c(411, 4210, 431) to the first matrix and c(412, 422, 432) to the second:

mat_1 <- rbind(mat_1, c(411, 4210, 431))
mat_2 <- rbind(mat_2, c(412, 422, 432))

You can now change the array arr:

arr <- array(cbind(mat_1, mat_2), c(4, 3, 2),dimnames = list(c(dimnames(arr)[[1]], "year_4"),dimnames(arr)[[2]], c(dimnames(arr)[[3]])))
arr
, , firm_1

       var_1 var_2 var_3
year_1   111  1210   131
year_2   211  2210   231
year_3   311  3210   331
year_4   411  4210   431

, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332
year_4   412   422   432

The fourth row is now added to both matrices.

4.3.4.2 Removing elements from an array

Removing parts of an array can be done using negative indices. However, in that case, you need to make sure that the dimensions of the various matrices stay equal. For instance, to remove the fourth row from all matrices in arr:

arr[-4, , ]
, , firm_1

       var_1 var_2 var_3
year_1   111  1210   131
year_2   211  2210   231
year_3   311  3210   331

, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332

To remove a matrix (e.g. the third) from arr_new:

arr_new[, , -3]
, , firm_1

       var_1 var_2 var_3
year_1   111  1210   131
year_2   211  2210   231
year_3   311  3210   331

, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332

4.3.5 Applying functions to an array

As most functions are vectorized, most will apply to each element of an array. For instance

  • natural logarithm:
log(arr)
, , firm_1

          var_1    var_2    var_3
year_1 4.709530 7.098376 4.875197
year_2 5.351858 7.700748 5.442418
year_3 5.739793 8.074026 5.802118
year_4 6.018593 8.345218 6.066108

, , firm_2

          var_1    var_2    var_3
year_1 4.718499 4.804021 4.882802
year_2 5.356586 5.402677 5.446737
year_3 5.743003 5.774552 5.805135
year_4 6.021023 6.045005 6.068426
  • power (e.g. 2)
arr^2
, , firm_1

        var_1    var_2  var_3
year_1  12321  1464100  17161
year_2  44521  4884100  53361
year_3  96721 10304100 109561
year_4 168921 17724100 185761

, , firm_2

        var_1  var_2  var_3
year_1  12544  14884  17424
year_2  44944  49284  53824
year_3  97344 103684 110224
year_4 169744 178084 186624
  • square root
sqrt(arr)
, , firm_1

          var_1    var_2    var_3
year_1 10.53565 34.78505 11.44552
year_2 14.52584 47.01064 15.19868
year_3 17.63519 56.65686 18.19341
year_4 20.27313 64.88451 20.76054

, , firm_2

          var_1    var_2    var_3
year_1 10.58301 11.04536 11.48913
year_2 14.56022 14.89966 15.23155
year_3 17.66352 17.94436 18.22087
year_4 20.29778 20.54264 20.78461
  • expontential:
exp(arr)
, , firm_1

               var_1 var_2         var_3
year_1  1.609487e+48   Inf  7.808671e+56
year_2  4.326490e+91   Inf 2.099062e+100
year_3 1.163011e+135   Inf 5.642525e+143
year_4 3.126310e+178   Inf 1.516777e+187

, , firm_2

               var_1         var_2         var_3
year_1  4.375039e+48  9.636666e+52  2.122617e+57
year_2  1.176062e+92  2.590449e+96 5.705843e+100
year_3 3.161392e+135 6.963429e+139 1.533797e+144
year_4 8.498192e+178 1.871851e+183 4.123027e+187

After subsetting the appropriate matrix, you can apply these function to one or more matrices. If you reassign their values, these matrices will also change in the array:

arr[, , 1] <- log(arr[, , 1])
arr
, , firm_1

          var_1    var_2    var_3
year_1 4.709530 7.098376 4.875197
year_2 5.351858 7.700748 5.442418
year_3 5.739793 8.074026 5.802118
year_4 6.018593 8.345218 6.066108

, , firm_2

       var_1 var_2 var_3
year_1   112   122   132
year_2   212   222   232
year_3   312   322   332
year_4   412   422   432

You can calculate the column means and column sums (or their equivalent row function) using colMeans(). When we introduced this function for matrices, we disregarded the dims argument. Here this argument plays a role. dims = 1 shows the means per column and per matrix:

colMeans(arr_new, dims = 1)
      firm_1 firm_2 firm_3
var_1    211    212    213
var_2   2210    222    223
var_3    231    232    233

Changing this into dims = 2 calculated means for all values per matrix:

colMeans(arr_new, dims = 2)
firm_1 firm_2 firm_3 
   884    222    223 

Whether you need the first or the second option, depends on the data in the matrices. Here, if matrices refer to firms, variables to e.g. revenue, profit or market capitalization and the rows to years, an average across all variables per firm doesn’t make sense. However, if you data refers to measurements (e.g. temperature) per hour and location where each matrix is a day, an average across all measurements per day does make sense: it is the average daily temperature in e.g. a country.

colSums, rowSums and rowMeans work in a similar way.

The apply() function with MARGIN = 2 applies a function FUN to all columns of an array. For instance, the average for the all the columns across all matrices in arr_new can be calculated as

apply(arr_new, 2, mean)
var_1 var_2 var_3 
  212   885   232 

To use the apply function per matrix, you’ll have to write a for loop. We will discuss loops more in depth in Chapter 13, but the overall setup of a loop is straightforward. The first part if for (i in c(1, 2, 3)). Here i will first take the first value in c(1, 2, 3) i.e. i will be 1? The second part of the loop includes the statement that R needs to execute. For instance: k <- i^2. R will calculate the square of k and assign it to k. If R finishes with the code, it moves back to i in c(1, 2, 3) and changes to value from 1 in 2. It now executes the code with i = 2. Here, we use the fact that we can determine the number of matrices from dim(arr) The third position in that vector shows the number of matrices. This allows us to determine how many loops the for loop will make. The code R needs to execute is the apply() function. All we need to do is store the results in a separate matrix. With respect to the dimensions: the apply functio will generate a mean for every variable and for every matrix. If you store the means per matrix in a separate row, we need as many columns in the matrix as we have columns in the array and as many rows as there are matrices in the array. We are now in a position to write the loop. First we create the matrix for the results:

nc <- dim(arr_new)[2]
nr <- dim(arr_new)[3]
matrix_mean <- matrix(0, nr, nc)

# add column names and row names
# column names are the names in the array
# row names are the names of the matrices in the array

colnames(matrix_mean) <- dimnames(arr_new)[[2]]
rownames(matrix_mean) <- dimnames(arr_new)[[3]]

We can use this matrix to store the results as we apply the apply() function across all matrices in the array:

for (i in 1:dim(arr_new)[3]) {
  matrix_mean[i, ] <- apply(arr_new[, , i], 2, mean)
}

To see the results for the mean per variable and per matrix, you can check:

matrix_mean
       var_1 var_2 var_3
firm_1   211  2210   231
firm_2   212   222   232
firm_3   213   223   233

In a similar way, you can use the apply() function for all other functions, including your own.

First create an 4x3x2 array (24 values) arr1 filles with c(1:24)

Code
arr1 <- array(1:24, c(4, 3, 2))
arr1
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24

Using 2 4x3 matrices, mat1 and mat2, the first including c(1:12) and the second including c(13:24), create an array arr2 with these two matrices.

Code
mat1 <- matrix(1:12, 4, 3)
mat2 <- matrix(13:24, 4, 3)

arr2 <- array(cbind(mat1, mat2), c(4, 3, 2))
arr2
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24

Set names for the rows (obs_1, obs_2, …), the columns (var_1, var_2, …) and the matrices (mat_1, mat_2) of arr1.

Code
dimnames(arr1) <- list(c("obs_1", "obs_2", "obs_3", "obs_4"), 
                        c("var_1", "var_2", "var_3"), 
                        c("mat_1", "mat_2"))

Check the attributes of arr1.

Code
attributes(arr1)
$dim
[1] 4 3 2

$dimnames
$dimnames[[1]]
[1] "obs_1" "obs_2" "obs_3" "obs_4"

$dimnames[[2]]
[1] "var_1" "var_2" "var_3"

$dimnames[[3]]
[1] "mat_1" "mat_2"

Using arr2, extract

  • the value on the second row of the second column in the second matrix:
Code
arr2[2, 2, 2]
[1] 18
  • all values of on the first row of the first matrix:
Code
arr2[1, , 1]
[1] 1 5 9
  • all values in the third column of the second matrix:
Code
arr2[, 3, 2]
[1] 21 22 23 24
  • all values in the first row and the second column of both matrices:
Code
arr2[1, 2, ]
[1]  5 17
  • all values in the second and third column of the first matrix.
Code
arr2[, 1:2, 1]
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8
  • all values but those in the first row of both matrices:
Code
arr2[-1, , ]
, , 1

     [,1] [,2] [,3]
[1,]    2    6   10
[2,]    3    7   11
[3,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   14   18   22
[2,]   15   19   23
[3,]   16   20   24

Using names, extract the values in arr1

  • the first row and second column of both matrices
Code
arr1["obs_1", "var_2", ]
mat_1 mat_2 
    5    17 
  • the values in the second matrix
Code
arr1[, , "mat_2"]
      var_1 var_2 var_3
obs_1    13    17    21
obs_2    14    18    22
obs_3    15    19    23
obs_4    16    20    24

Extract all values larger than 15 from arr2

Code
arr2[arr2 > 15]
[1] 16 17 18 19 20 21 22 23 24

Create a 4x3 matrix, mat_3, filled with c(25:36)

Code
mat_3 <- matrix(25:36, 4, 3)

Add this matrix to arr2

Code
arr2 <- abind::abind(arr2, mat_3)

Remove the fourth row of each matrix in arr2.

Code
arr2 <- arr2[-4, , ]
arr2
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23

, , 3

     [,1] [,2] [,3]
[1,]   25   29   33
[2,]   26   30   34
[3,]   27   31   35

Remove the third matrix from arr2

Code
arr2[, , -3]
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23

Calculate the column means for every column in each matrix of arr1.

Code
colMeans(arr1, dims = 1)
      mat_1 mat_2
var_1   2.5  14.5
var_2   6.5  18.5
var_3  10.5  22.5

Calculate the column sum for every column in each matrix of arr1.

Code
colSums(arr1, dims = 1)
      mat_1 mat_2
var_1    10    58
var_2    26    74
var_3    42    90

Use the apply() function to calculate for each column in arr1 the value: (x - min(x)/(max(x)- min(x))). Write your code in such a way that you can apply it to other arrays with different dimensions. You have to write a for loop. This statement includes for (i in ...) {apply(...)}. Store the results in an array res.

Code
res <- array(0, c(4, 3, 2))
for (i in 1:dim(arr1)[3]) {
  res[, , i] <- apply(arr1[, , i], 2, function(x) (x - min(x))/(max(x) - min(x)))
}
res
, , 1

          [,1]      [,2]      [,3]
[1,] 0.0000000 0.0000000 0.0000000
[2,] 0.3333333 0.3333333 0.3333333
[3,] 0.6666667 0.6666667 0.6666667
[4,] 1.0000000 1.0000000 1.0000000

, , 2

          [,1]      [,2]      [,3]
[1,] 0.0000000 0.0000000 0.0000000
[2,] 0.3333333 0.3333333 0.3333333
[3,] 0.6666667 0.6666667 0.6666667
[4,] 1.0000000 1.0000000 1.0000000
 #| echo: false
 #| error: false
 #| message: false
 #| output: false
 #| warning: false

rm(arr, matc1, matc2, arr_new, arrc, matrix_mean, nc, nr)

4.4 Lists

Lists are widely used in R. In the previous section we referred to lists a couple of times. For instance, str_extract_all returns a list by default. Likewise, the apply() function returns a list unless you add simplify = TRUE. The attributes of a matrix are shown in a list. Here we add more depth. With lists we move from homogeneous data structures to heterogeneous data structures. Heterogeneous datas tructures can be used to store various types of data.

4.4.1 What are lists?

Like vectors, lists are uni-dimensional. Unlike vectors, matrices or arrays, they can be used to store various data types. In a list, you store vectors, matrices, characters, formulas, plots or other lists or arrays. In other words, every element in a list can have both a different type as well as different dimensions. As a result, lists are a very flexible way of storing a wide variety of data into one data structure and are used to store, e.g. hierarchical data and to organize complete datasets, to store output from formulas or functions. For instance, the dimnames() function for arrays shows a complex data structure including the dimensions of an array as well as the names of the columns, rows and matrices. The first are numeric, the second are character variables. The first include 3 elements: the number of rows, columns and matrices while the names can be as short as one and further take one any size.

4.4.2 Non-nested list

A non-nested list is a list that doesn’t include any other lists. In other words, the element of this list are e.g. matrices, vectors or character variables. Suppose you have the following data per student: the name, student number, a logical indicator for exchange students, the program in which the student is enrolled and information on the student’s courses in his or her individual program including their name, ects and lecture hours. These data are stored in various data structures:

student <- "Alice Wonderland"
studentnr <- "r00369258"
program <- "Bachelor business adminstration"
exchange = F
course <- c("Data and programming skills", "Strategic management", "Macro-economics and economic policy", "Economic sociology", "Introduction to methods for operational research")
ects <- c(6, 3, 6, 3, 3)
hours <- c(52, 26, 52, 26, 26)

From the previous section, you should recognize these structures as a character variable, a character vector, a logical value and numeric vectors.

To create a list, you can use the list() function. This functions main arguments are the objects to store in the list. These objects could be named, but for now, we’ll add no names. We can add all these structures to a list using:

stud1 <- list(student, studentnr, program, exchange, course, ects, hours)

Let’s first inspect the structure of this list using str():

str(stud1)
List of 7
 $ : chr "Alice Wonderland"
 $ : chr "r00369258"
 $ : chr "Bachelor business adminstration"
 $ : logi FALSE
 $ : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
 $ : num [1:5] 6 3 6 3 3
 $ : num [1:5] 52 26 52 26 26

Here, you can see that this list as 7 elements: 4 with type character, 2 with type numeric and 1 with type logical. As you can see, lists can store elements with various types. We can also inspect the list by printing it:

stud1
[[1]]
[1] "Alice Wonderland"

[[2]]
[1] "r00369258"

[[3]]
[1] "Bachelor business adminstration"

[[4]]
[1] FALSE

[[5]]
[1] "Data and programming skills"                     
[2] "Strategic management"                            
[3] "Macro-economics and economic policy"             
[4] "Economic sociology"                              
[5] "Introduction to methods for operational research"

[[6]]
[1] 6 3 6 3 3

[[7]]
[1] 52 26 52 26 26

Here, you see that stud1 has two levels: the first is the level of the 7 elements in that list, the second level are the individual elements of each of the 7 elements. The highest hierarchy is shown with double square brackets [[ ]]. The second level is shown with one square bracket [ ] You can verify the class and type of stud1 using

class(stud1)
[1] "list"
typeof(stud1)
[1] "list"

As you can see, both show “list”. You can determine the number of components in a list using the length() function. Here, stud1 has 7 components. To check this, you can use

length(stud1)
[1] 7

A lot of the functions that we saw in the previous sections that return a list, return a non named list. For instance ’str_extract_all() returns

char <- c("Fair if foul and foul is fair.",  "Hover through the fog and filthy air.")
stringr::str_extract_all(char, pattern = "fair|fog|filthy")
[[1]]
[1] "fair"

[[2]]
[1] "fog"    "filthy"

They do this because the results of these function is often not compatible with a matrix or vector. For instance, here, you have two matches: one with 1 element (fair) and one with 2 elements (fog and filthy both appear in the second element of the character vector). To store these results, you need a list.

You can add a name to the elements of a list by adding them in the list() function. For instance:

stud1 <- list(name = student, 
                 number = studentnr,
                 program = program,
                 exchange = exchange,
                 course  = course,
                 hours = hours, 
                 ects = ects)

If you check the structure of the list, you can now see the names of that list:

str(stud1)
List of 7
 $ name    : chr "Alice Wonderland"
 $ number  : chr "r00369258"
 $ program : chr "Bachelor business adminstration"
 $ exchange: logi FALSE
 $ course  : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
 $ hours   : num [1:5] 52 26 52 26 26
 $ ects    : num [1:5] 6 3 6 3 3

Printing the list also reveals their names

stud1
$name
[1] "Alice Wonderland"

$number
[1] "r00369258"

$program
[1] "Bachelor business adminstration"

$exchange
[1] FALSE

$course
[1] "Data and programming skills"                     
[2] "Strategic management"                            
[3] "Macro-economics and economic policy"             
[4] "Economic sociology"                              
[5] "Introduction to methods for operational research"

$hours
[1] 52 26 52 26 26

$ects
[1] 6 3 6 3 3

To extract the names in the list, you can use names().

names(stud1)
[1] "name"     "number"   "program"  "exchange" "course"   "hours"    "ects"    

Some function in R return a named list. For instance the attributes() function shows the attributes of a vector or a matrix as a names list:

attributes(matrix(c(10, 20, 30, 40), 2, 2, dimnames = list(c("obs1", "obs2"), c("var1", "var2"))))
$dim
[1] 2 2

$dimnames
$dimnames[[1]]
[1] "obs1" "obs2"

$dimnames[[2]]
[1] "var1" "var2"

Again note that attibutes() returns a list as it wouldn’t be possible to show that result otherwise as it mixes characters and numeric values.

4.4.3 Nested lists

Inside a list, you can have lists. In that case, lists are nested. let’s add two new students and store their data in lists stud2 and stud3:

student <- "Bart Vader"
studentnr <- "r00362958"
program <- "Bachelor business adminstration"
exchange = F
course <- c("Data and programming skills", "Strategic management", "Macro-economics and economic policy", "Financial statement analysis", "Entrepreneurship and business planning")
ects <- c(6, 3, 6, 6, 3)
hours <- c(52, 26, 52, 52, 26)

stud2 <- list(name = student, 
             number = studentnr,
             program = program,
             exchange = exchange,
             course  = course,
             hours = hours, 
             ects = ects)

student <- "Clark Kent"
studentnr <- "r00362478"
program <- "Bachelor business adminstration"
exchange = T
course <- c("Macro-economics and economic policy", "Economic sociology", "Entrepreneurship and business planning", "Financial accouing B", "Mathematics for business B")
ects <- c(6, 3, 3, 3, 3)
hours <- c(52, 26, 26, 26, 26)

stud3 <- list(name = student, 
             number = studentnr,
             program = program,
             exchange = exchange,
             course  = course,
             hours = hours, 
             ects = ects)

Using list() we can add these three students in one list and give each list a name

allstud <- list(student1 = stud1, 
                 student2 = stud2, 
                 student3 = stud3)

Note that the three lists here include the same components. However, this is not necessary. A nested list can include lists with various components.

From the structure of the list

str(allstud)
List of 3
 $ student1:List of 7
  ..$ name    : chr "Alice Wonderland"
  ..$ number  : chr "r00369258"
  ..$ program : chr "Bachelor business adminstration"
  ..$ exchange: logi FALSE
  ..$ course  : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
  ..$ hours   : num [1:5] 52 26 52 26 26
  ..$ ects    : num [1:5] 6 3 6 3 3
 $ student2:List of 7
  ..$ name    : chr "Bart Vader"
  ..$ number  : chr "r00362958"
  ..$ program : chr "Bachelor business adminstration"
  ..$ exchange: logi FALSE
  ..$ course  : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Financial statement analysis" ...
  ..$ hours   : num [1:5] 52 26 52 52 26
  ..$ ects    : num [1:5] 6 3 6 6 3
 $ student3:List of 7
  ..$ name    : chr "Clark Kent"
  ..$ number  : chr "r00362478"
  ..$ program : chr "Bachelor business adminstration"
  ..$ exchange: logi TRUE
  ..$ course  : chr [1:5] "Macro-economics and economic policy" "Economic sociology" "Entrepreneurship and business planning" "Financial accouing B" ...
  ..$ hours   : num [1:5] 52 26 26 26 26
  ..$ ects    : num [1:5] 6 3 3 3 3

you can now see that this list has 3 levels: the first includes the three lists for every student. The second level shows the list per student and the third level includes the individual values for each list component. These last two levels coincide with the components of stud1, stud2 and stud3. You could add more lists. For instance, you could define a list with course data including the course, the hours and ects vectors and store these vectors in a seperate list. In that case, you would add a hierarchy.

Here, the function names() returns the names of the highest hierarchy:

names(allstud)
[1] "student1" "student2" "student3"

and length() shows the number of components in the highest hierarchy:

length(allstud)
[1] 3

A special case of lists are plots. Recall the plots with the random draws from various distributions, e.g.

hist(v_norm <- rnorm(n = 100, mean = 0, sd = 1), 
     probability = TRUE, 
     col = "lightblue", 
     border = "white", 
     xlab = "Value", 
     main = "Normal")

You can assign this plot to an object, plot_norm:

plot_norm <- hist(v_norm <- rnorm(n = 100, mean = 0, sd = 1), 
                  probability = TRUE, 
                  col = "lightblue", 
                  border = "white", 
                  xlab = "Value", 
                  main = "Normal")

Now check the type of this plot

typeof(plot_norm)
[1] "list"

As you can see, this plot is stored as a list. In other words, if you store plots in a list, you are using nested lists.

4.4.4 Unlist

The function unlist(x, recursive = TRUE, use.names = TRUE) simplifies the list structure the returns all the individual components of the list. The option recurive = TRUE by default will apply this function to all components of the list. With nested lists, this default option unlists all lists within the list. The last option use.names = TRUE by default preserves the names. To see what this function does, let’s apply it to stud1. As we don’t have any lists within stud1 the option recursive is not applicable. Unlisting stud1 returns:

unlist(stud1)
                                              name 
                                "Alice Wonderland" 
                                            number 
                                       "r00369258" 
                                           program 
                 "Bachelor business adminstration" 
                                          exchange 
                                           "FALSE" 
                                           course1 
                     "Data and programming skills" 
                                           course2 
                            "Strategic management" 
                                           course3 
             "Macro-economics and economic policy" 
                                           course4 
                              "Economic sociology" 
                                           course5 
"Introduction to methods for operational research" 
                                            hours1 
                                              "52" 
                                            hours2 
                                              "26" 
                                            hours3 
                                              "52" 
                                            hours4 
                                              "26" 
                                            hours5 
                                              "26" 
                                             ects1 
                                               "6" 
                                             ects2 
                                               "3" 
                                             ects3 
                                               "6" 
                                             ects4 
                                               "3" 
                                             ects5 
                                               "3" 

The output shows all individual components. Note that e.g. course, which is a character vector, is simplfied to its individual elements. R labels these elements as e.g. course1, course2, … . Likewise, hours, a numeric vector, is shown as individual elements with name hours1, hours2, … .

Applied to allstud, a nested list and using recursive = FALSE, returns the individual components of the three lists as one long list. The names of the highest hierarchy in addstud is used to construct names. Using unlist(allstud, recursive = TRUE) returns:

unlist(allstud, recursive = FALSE)
$student1.name
[1] "Alice Wonderland"

$student1.number
[1] "r00369258"

$student1.program
[1] "Bachelor business adminstration"

$student1.exchange
[1] FALSE

$student1.course
[1] "Data and programming skills"                     
[2] "Strategic management"                            
[3] "Macro-economics and economic policy"             
[4] "Economic sociology"                              
[5] "Introduction to methods for operational research"

$student1.hours
[1] 52 26 52 26 26

$student1.ects
[1] 6 3 6 3 3

$student2.name
[1] "Bart Vader"

$student2.number
[1] "r00362958"

$student2.program
[1] "Bachelor business adminstration"

$student2.exchange
[1] FALSE

$student2.course
[1] "Data and programming skills"           
[2] "Strategic management"                  
[3] "Macro-economics and economic policy"   
[4] "Financial statement analysis"          
[5] "Entrepreneurship and business planning"

$student2.hours
[1] 52 26 52 52 26

$student2.ects
[1] 6 3 6 6 3

$student3.name
[1] "Clark Kent"

$student3.number
[1] "r00362478"

$student3.program
[1] "Bachelor business adminstration"

$student3.exchange
[1] TRUE

$student3.course
[1] "Macro-economics and economic policy"   
[2] "Economic sociology"                    
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"                  
[5] "Mathematics for business B"            

$student3.hours
[1] 52 26 26 26 26

$student3.ects
[1] 6 3 3 3 3

You can see that this is a list if you use is.list():

is.list(unlist(allstud, recursive = FALSE))
[1] TRUE

Note that all names include a dot “.” to separate the name of the list (e.g. student1) and the name of the component (e.g. name). Using names() you can select the names:

names(unlist(allstud, recursive = FALSE))
 [1] "student1.name"     "student1.number"   "student1.program" 
 [4] "student1.exchange" "student1.course"   "student1.hours"   
 [7] "student1.ects"     "student2.name"     "student2.number"  
[10] "student2.program"  "student2.exchange" "student2.course"  
[13] "student2.hours"    "student2.ects"     "student3.name"    
[16] "student3.number"   "student3.program"  "student3.exchange"
[19] "student3.course"   "student3.hours"    "student3.ects"    

In case of nested lists, the default recursive = TRUE will simplify every list in the nested list. In other words, the output will be similar to the one for unlisting unnested lists.

4.4.5 Subsetting a list

4.4.5.1 Subsetting non-nested lists

To subset a list, you can use index positions using both the [] subsetting operator as well as the double square brackets operator [[]]. Let’ start with the first: [] and extract the first element of stud1, the list with the data on the first student Alice Wonderland:

stud1[1]
$name
[1] "Alice Wonderland"

As you can see, this operator returns the first component of stud1 and does so as a list. In other words, [] preserves the structure of the data. You can see this from the output (which refers to the $name) as well as from the class of the output:

class(stud1[1])
[1] "list"

The double square brackets [[]]are a simplifying operator. They simplify the result as much as possible e.g. to a numeric vector, a character vector, a logical value … . For instance, let’s use the [[]] to extract the first element of stud1:

stud1[[1]]
[1] "Alice Wonderland"

Recall that the preserving subsetting operator returned a list, here R simplifies to a character variable.

class(stud1[[1]])
[1] "character"

Let’s now subset the sixth element of stud1, the hours for each course. Using the single square brackets, R returns a list:

stud1[6]
$hours
[1] 52 26 52 26 26
class(stud1[6])
[1] "list"

while the the simplifying operator returns a numeric vector:

stud1[[6]]
[1] 52 26 52 26 26
is.vector(stud1[[6]])
[1] TRUE
class(stud1[[6]])
[1] "numeric"

To subset this vector, you start from the simplifying operator. As this operator creates a vector, you can now use the subsetting rules for a vector. Here, the vector you subset is stud1[[6]]. To subset the first element, you add [1]:

stud1[[6]][1]
[1] 52

You can now use all subsetting rules for vectors, e.g.

  • a range:
stud1[[6]][1:4]
[1] 52 26 52 26
  • all but the first:
stud1[[6]][-1]
[1] 26 52 26 26
  • a logical condition:
stud1[[6]][stud1[[6]] > 30]
[1] 52 52

If the list is named, you can also use the names and add them between quotation marks in the preserving subsetting operator [] or the simplifying operator [[]]. The first returns a list, the second simplifies to output. To extract the name of the student in stud2 and return a list, you can use:

stud2["name"]
$name
[1] "Bart Vader"

Simplifying this result can be done using the simplifying subsetting operator [[]]:

stud2[["name"]]
[1] "Bart Vader"

You can extract the value of a list and simplify the result also in a second way: you add the name of the component after the name of the list separated by the $ subsetting operator: name_of_list$name_of_element. Doing so, R simplifies the results. For instance, to subset the component ects from the list stud2, you can use:

stud2$ects
[1] 6 3 6 6 3

Here, the output is simplified to a vector. In other words, stud2$ects returns the same output as stud2[["ects"]]. You can now use all subsetting methods for a vector.

stud2$ects[3]
[1] 6

Subsetting within an component of a list is determined by the class of that element. In the examples, R simplified to a numeric vector. If one of the elements of the list would be a matrix, you would use the subsetting rules for a matrix.

As was the case with vectors, matrices or arrays, a negative index position extracts all but the element that is in that position. For instance, extracting all element of stud2 except the first can be done using:

stud2[-1]
$number
[1] "r00362958"

$program
[1] "Bachelor business adminstration"

$exchange
[1] FALSE

$course
[1] "Data and programming skills"           
[2] "Strategic management"                  
[3] "Macro-economics and economic policy"   
[4] "Financial statement analysis"          
[5] "Entrepreneurship and business planning"

$hours
[1] 52 26 52 52 26

$ects
[1] 6 3 6 6 3

To extract multiple values, you combine them via c(). For instance, to extract the first and third element of the list stud3, you add these to the preserving operator []

stud3[c(1, 3)]
$name
[1] "Clark Kent"

$program
[1] "Bachelor business adminstration"

Note that in this case the simplifying operator doesn’t work: the output includes heterogeneous variable types. With named elements, you can also include the names of these elements:

stud3[c("name", "number")]
$name
[1] "Clark Kent"

$number
[1] "r00362478"

Using negative index position, you can extract all but the elements with the negative index position. For instance, extracting all elements from stud3 except the first and third:

stud3[c(-1, -3)]
$number
[1] "r00362478"

$exchange
[1] TRUE

$course
[1] "Macro-economics and economic policy"   
[2] "Economic sociology"                    
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"                  
[5] "Mathematics for business B"            

$hours
[1] 52 26 26 26 26

$ects
[1] 6 3 3 3 3

You can also use logical values to subset a list. For instance:

stud1[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)]
$name
[1] "Alice Wonderland"

$exchange
[1] FALSE

$ects
[1] 6 3 6 3 3

This allows you to extract e.g. components of a list using patterns in a name. For instance, extracting a component that includes the pattern “ects” can be done using grepl() where this function searches in the vector names(stud1) for a match with the pattern “ects”:

stud1[grepl(pattern = "ects", names(stud1))]
$ects
[1] 6 3 6 3 3

4.4.5.2 Subsetting nested lists

Recall that nested lists are lists that include other lists as their elements. How do you subset a list with lists? Let’s first use index positions. Using [] returns a list. For instance,

allstud[1]
$student1
$student1$name
[1] "Alice Wonderland"

$student1$number
[1] "r00369258"

$student1$program
[1] "Bachelor business adminstration"

$student1$exchange
[1] FALSE

$student1$course
[1] "Data and programming skills"                     
[2] "Strategic management"                            
[3] "Macro-economics and economic policy"             
[4] "Economic sociology"                              
[5] "Introduction to methods for operational research"

$student1$hours
[1] 52 26 52 26 26

$student1$ects
[1] 6 3 6 3 3

returns the first list, stud1 but the output keeps all references to e.g. the name of stud1 within the list allstud. Simplifying using the [[]] operator removes part of the structure of stud1, e.g. the reference to $student1 but the results are still a list.

allstud[[1]]
$name
[1] "Alice Wonderland"

$number
[1] "r00369258"

$program
[1] "Bachelor business adminstration"

$exchange
[1] FALSE

$course
[1] "Data and programming skills"                     
[2] "Strategic management"                            
[3] "Macro-economics and economic policy"             
[4] "Economic sociology"                              
[5] "Introduction to methods for operational research"

$hours
[1] 52 26 52 26 26

$ects
[1] 6 3 6 3 3

Note that this shouldn’t be surprising as stud1 is a list and [[]] returns the most simplified version of this list: which is in this case a list nested in another list. As an alternative to the index position, you can also refer to the name of the list you want to extract. Adding that name to the preservering subsetting operator will extract the list while preserving the structure of the list. For instance, extracting the second list:

allstud["student2"]
$student2
$student2$name
[1] "Bart Vader"

$student2$number
[1] "r00362958"

$student2$program
[1] "Bachelor business adminstration"

$student2$exchange
[1] FALSE

$student2$course
[1] "Data and programming skills"           
[2] "Strategic management"                  
[3] "Macro-economics and economic policy"   
[4] "Financial statement analysis"          
[5] "Entrepreneurship and business planning"

$student2$hours
[1] 52 26 52 52 26

$student2$ects
[1] 6 3 6 6 3

Doing so with the simplifying operator returns the original list:

allstud[["student2"]]
$name
[1] "Bart Vader"

$number
[1] "r00362958"

$program
[1] "Bachelor business adminstration"

$exchange
[1] FALSE

$course
[1] "Data and programming skills"           
[2] "Strategic management"                  
[3] "Macro-economics and economic policy"   
[4] "Financial statement analysis"          
[5] "Entrepreneurship and business planning"

$hours
[1] 52 26 52 52 26

$ects
[1] 6 3 6 6 3

Let’s now move one step lower in the hierarchy. If you want to extract e.g. the name of student1, you first extract the first list using the simplifying operator. Doing so, you extract the list stud1. Adding [1] extract the first index position of the list stud1

allstud[[1]][1]
$name
[1] "Alice Wonderland"

while using [[1]] simplifies the output

allstud[[1]][[1]]
[1] "Alice Wonderland"

A second way using the names of the elements. For instance, extracting the name of the student in student1 using the preserving operator to return a list:

allstud[["student1"]]["name"]
$name
[1] "Alice Wonderland"

or the simplifying operator to return a character variable:

allstud[["student1"]][["name"]]
[1] "Alice Wonderland"

Third, recall that the $ operator acts as a simplifying operator. In other words, you can extract the first list using allstud$student1. You can now extract the elements of that list using either the presering operator [], the simplifying operator [[]] both with index positions and names as well as the $ operator. For instance to extract the values in ects:

  • preserving the structure:
allstud$student1[7]
$ects
[1] 6 3 6 3 3
allstud$student1["ects"]
$ects
[1] 6 3 6 3 3
  • simplifying the structure using [[]]:
allstud$student1[[7]]
[1] 6 3 6 3 3
allstud$student1[["ects"]]
[1] 6 3 6 3 3
  • simplifying the structure using $:
allstud$student1$ects
[1] 6 3 6 3 3

Note that you can mix both index and named subsetting. Recall that the [[]] operator returns a list, but removes all references to the name of that list (e.g. student1). Here, you can For instance

allstud[[1]]$ects
[1] 6 3 6 3 3

extracts the number of credits for student1.

4.4.5.3 Extracting components across many lists in a nested list.

allstud includes data for all students, where each student’s data is stored in a separate list. In the previous section, we subsetted data for an individual student. But what if we need similar data for each student in the list. To do that, you can use the Filter()function or use unlist() to remove the highest list level and extract the information from lists in at the second level.

Using the Filter(f, x) function (note the uppercase F), you can filter nested lists. The arguments of this function are f, a function that returns a logical vector and x a vector. The function uses f to subset x. Here, x refers to the nested list allstud. In that nested list, there are vectors such as ects, hours or course. We can use these to extract information on all students that meet a condition. This condition is defined by f. For instance, suppose that we want to extract all students whose courses are more than 21 ECTS. To calculate the total number of ECTS, we use sum(x$ects). The x here refers to the allstud. In other words, x$ects is shorthand for allstud$stduenti$ects. The condition can be written as sum(x$exts) > 21). We now also have the function f: function(x) sum(x$ects) > 21. Using this in Filter():

Filter(function(x) sum(x$ects) > 21, allstud)
$student2
$student2$name
[1] "Bart Vader"

$student2$number
[1] "r00362958"

$student2$program
[1] "Bachelor business adminstration"

$student2$exchange
[1] FALSE

$student2$course
[1] "Data and programming skills"           
[2] "Strategic management"                  
[3] "Macro-economics and economic policy"   
[4] "Financial statement analysis"          
[5] "Entrepreneurship and business planning"

$student2$hours
[1] 52 26 52 52 26

$student2$ects
[1] 6 3 6 6 3

This function returns the list of the second student. This is the only student whose ECTS is higher than 21. Extracting all exchange students (exhange = T) can be done using:

Filter(function(x) x$exchange == T, allstud)
$student3
$student3$name
[1] "Clark Kent"

$student3$number
[1] "r00362478"

$student3$program
[1] "Bachelor business adminstration"

$student3$exchange
[1] TRUE

$student3$course
[1] "Macro-economics and economic policy"   
[2] "Economic sociology"                    
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"                  
[5] "Mathematics for business B"            

$student3$hours
[1] 52 26 26 26 26

$student3$ects
[1] 6 3 3 3 3

You can also first use unlist to remove the highest level of the nested list. Recall that unlist() removes the upper hierarchy of a nested list and that you can collect the names for each of the components in the remaining list. Using these names, you can now extract components. To see how, let’s first store the output of unlist in a separate list:

unl_allstud <- unlist(allstud, recursive = FALSE)

and extract the names

unl_allstud_names <- names(unl_allstud)

let’s now try to extract all courses for every student. This is where regular expressions enter. Here you want to extract all courses. These are stored in e.g. student1.course or student2.course, i.e. a pattern “student”“digit”“.”“course”. In terms of a regular expression, this is a pattern "student\\d.course". Recall that grepl() returns a logical value TRUE is a pattern is matched. In other words, grepl(pattern = "student\\d.course", unl_allstud_names) will return TRUE is the names vector includes a names such as student1.course or student3.course. We can now use this vector to subset unl_allstud:

unl_allstud[grepl(pattern = "student\\d.course", unl_allstud_names)]
$student1.course
[1] "Data and programming skills"                     
[2] "Strategic management"                            
[3] "Macro-economics and economic policy"             
[4] "Economic sociology"                              
[5] "Introduction to methods for operational research"

$student2.course
[1] "Data and programming skills"           
[2] "Strategic management"                  
[3] "Macro-economics and economic policy"   
[4] "Financial statement analysis"          
[5] "Entrepreneurship and business planning"

$student3.course
[1] "Macro-economics and economic policy"   
[2] "Economic sociology"                    
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"                  
[5] "Mathematics for business B"            

As you can see, we now have a list which includes the courses for each student. If you assign this result to a list e.g. courses, you can now subset these courses and find studenten who, e.g. took Economic sociolocy.

You can write this code shorter:

unlist(allstud, recursive = FALSE)[grepl(pattern = "student\\d.course", names(unlist(allstud, recursive = FALSE)))]
$student1.course
[1] "Data and programming skills"                     
[2] "Strategic management"                            
[3] "Macro-economics and economic policy"             
[4] "Economic sociology"                              
[5] "Introduction to methods for operational research"

$student2.course
[1] "Data and programming skills"           
[2] "Strategic management"                  
[3] "Macro-economics and economic policy"   
[4] "Financial statement analysis"          
[5] "Entrepreneurship and business planning"

$student3.course
[1] "Macro-economics and economic policy"   
[2] "Economic sociology"                    
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"                  
[5] "Mathematics for business B"            

4.4.6 Changing the elements in a list

There are three ways to change the elements in a list: first you change one of a list’s components. Second, you can add a new component and third, you can remove a component.

4.4.6.1 Unnested lists

4.4.6.1.1 Changing a component in a list

Changing one of the components of a non nested list is not different from changing one of the elements of a vector or matrix. Subsetting this component and reassigning its value will do just that. For instance, changing the value FALSE to TRUE in the exchange component of stud1:

stud1[4] <- TRUE

as an alternative, you can also use the other subsetting operators, [[]] or $. For instance

stud1$exchange <- FALSE

changes this value back to FALSE.

To change a value in a vector, matrix or array, you would use a similar approach. For instance, changing the first element in the hours vector for student 1 from 52 in 26 uses the fact that stud1$hours is a vector. Changing the first element of this vector:

stud1$hours[1] <- 26

Note that here, you can use any approach we have covered for the other data structures. In other words, you can increase the number of elements in a vector (e.g. by adding them via append() or via c()), add columns and rows to a matrix using rbind or cbind or change the number of matrices in an array.

4.4.6.1.2 Adding components to a list

Suppose that you would like to add the total number of hours to list in stud1, stud2 and stud3. The first approach adds a component by assigning its value to stud1[8]. Recall that stud1 includes 7 components. Adding an new components adds one component to the existing ones. This new component will be the eight component. You can define this more in general using length(stud1). Recall that this function shows the number of components in stud1. Adding one will create a new component. This procedure is safer than just using a number such as 8. Especially is you have long and complex lists, you could easily overwrite an existing component. To add to total hours we use the fact that stud1$hours is a vector. Using sum(stud1$hours) allows to add the total number of hours:

stud1[length(stud1) + 1] <- sum(stud1$hours)
stud1[8]
[[1]]
[1] 156

Note that stud1[8] is not named. To fix this, we can add a name total. names(stud1) is a vector. We can add an eight element to that vector using:

names(stud1)[8] <- "total"
stud1$total
[1] 156

To name the component, you could again use length(stud1). However, in this case, note that you want to change the last component and not the last plus one.

The second way creates a named component. Do do so, we add the name of that component, total to stud2 using the $ operator. We can assign the total number of hours to that names component:

stud2$total <- sum(stud2$hours)
stud2$total
[1] 208

The third approach uses the c() function. Here, we add the component “total” by combining it with the existing components of stud3 an assigning this new list to stud3. For stud3:

stud3 <- c(stud3, "total" = sum(stud3$hours))
stud3$total
[1] 156

The fourth approach uses append(). Here, you include the list as well as the value you want to add in the function arguments: append(list, value). Using this function, you can also add the position using the after = option.

If you want to add a vector, matrix or array as a new component of the list, using the first, third and fourth approach you need to tell R you want to include the values in that structure as a structure and not as individual components. To add the former, you need to include that structure in a list() statement. For instance, to add a new a new vector semester with values c(1, 2):

stud1[length(stud1) + 1] <- list(c(1, 2))
names(stud1)[length(stud1)] <- "semester"

stud3 <- c(stud3, "semester" = list(c(1, 2)))

You can now check that this component was added as a vector:

stud1$semester
[1] 1 2
stud3$semester
[1] 1 2

Let’s see what would happen is you didn’t include the list() statement. To do so, we’ll use a copy of stud1:

stud1_copy <- stud1
stud1_copy <- c(stud1_copy, "test" = c(200, 300))
stud1_copy$test1
[1] 200
stud1_copy$test2
[1] 300
rm(stud1_copy)

Here, you can see that R added both values in c(200, 300) as individual elements to components it named test1 and test2. In other words, R didn’t add the vector, it added the values.

Using the second approach to add a new component doesn’t require the `list()´ statement:

stud2$semester <- c(1, 2)
stud2$semester
[1] 1 2

Here, you are explicitly telling R that the values c(1, 2) have to be added to one component in the list stud2$semester.

4.4.6.1.3 Removing components from a list

The first approach to removing components from a list uses negative index numbers. Recall that a negative index subsets all except the negative indices. Using this approach, you assign the value of the subsetted list to the same list name. Doing to will give you a new list with the same name, but without the removed component. For instance, to remove “total” from stud3:

stud3 <- stud3[-8]
stud3
$name
[1] "Clark Kent"

$number
[1] "r00362478"

$program
[1] "Bachelor business adminstration"

$exchange
[1] TRUE

$course
[1] "Macro-economics and economic policy"   
[2] "Economic sociology"                    
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"                  
[5] "Mathematics for business B"            

$hours
[1] 52 26 26 26 26

$ects
[1] 6 3 3 3 3

$semester
[1] 1 2

A second way to remove components is to assign them NULL. For instance, to remove the total number of hours for stud1 and stud2:

stud1$total <- NULL
stud2[8] <- NULL

You can verify that both these lists lost their component total

str(stud1)
List of 8
 $ name    : chr "Alice Wonderland"
 $ number  : chr "r00369258"
 $ program : chr "Bachelor business adminstration"
 $ exchange: logi FALSE
 $ course  : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
 $ hours   : num [1:5] 26 26 52 26 26
 $ ects    : num [1:5] 6 3 6 3 3
 $ semester: num [1:2] 1 2

4.4.6.2 Nested lists

Let’s add a new student to allstud. The data for this student are collected in a list, stud4. This list will then be added to allstud. The fourth student:

student <- "Lois Lane"
studentnr <- "r00252478"
program <- "Bachelor business adminstration"
exchange = F
course <- c("Macro-economics and economic policy", "Economic sociology", "Entrepreneurship and business planning", "Financial accouing B", "Economics of the single market")
ects <- c(6, 3, 3, 3, 6)
hours <- c(52, 26, 26, 26, 52)

stud4 <- list(name = student, 
             number = studentnr,
             program = program,
             exchange = exchange,
             course  = course,
             hours = hours, 
             ects = ects)

We can now add this student to allstud. To do so, we use the append() function and add stud4to allstud using append(allstud, list(stud4)). We add the name using names(allstud)[4] \<- "student4". Doing so will add the fourth student to this list

allstud <- append(allstud, list(stud4))
names(allstud)[4] <- "student4"
allstud$student4
$name
[1] "Lois Lane"

$number
[1] "r00252478"

$program
[1] "Bachelor business adminstration"

$exchange
[1] FALSE

$course
[1] "Macro-economics and economic policy"   
[2] "Economic sociology"                    
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"                  
[5] "Economics of the single market"        

$hours
[1] 52 26 26 26 52

$ects
[1] 6 3 3 3 6

Using the append() function also allows you to specify where the new list enters. Using the argument after = 2 for instance would add student4 after the second position.

As an alternative, you can use c(allstud, list(stud4)). Doing so will add the fourth student to allstud. Here too, you will have to add names. If you cont want to add names, you can use the $ operator and add the name of the new list, e.g. adding the data for student 4 as allstud$student5:

allstud$student5 <- stud4

Removing lists from a nested list follows a similar approach to the one to remove components from a list: you subset using a negative index and reassign this new list to the name of the old list or you use NULL to remove the list. For instance, to remove student5 from allstud:

allstud$student5 <- NULL
length(allstud)
[1] 4

4.4.7 lapply() and sapply().

4.4.7.1 Applying a function to list components

We already met the apply() function. This function was used to apply functions the rows or columns of a matrix and allows to avoid for loops. The lapply() and sapply() function are designed to apply a function to a list. lapply() returns a list. Hence, the name “l”apply: the list version of apply. sapply() simplifies the result to a vector or matrix or an array. Hence, the name “s” apply: the simplified version of lapply. Like the apply() function, both allow you to avoid loops. Most of what you do within lapply() or sapply() can be done with a loop as well. However, as with apply(), it is often more efficient to use these function.

To see how these work, let’s start from a simple example: a list with 3 numeric vectors as component:

list1 <- list(vec1 = rnorm(100, 0, 1), 
              vec2 = rnorm(100, 5, 10), 
              vec3 = rnorm(100, 10, 20))

Let’s now use the lapply() function to calculate the mean of each of list1’s components. This function has a couple of arguments. First, the list that will be used to apply a function to. Second, the argument FUN, the function to be applied to each component of the list, including optional arguments, e.g. na.rm = TRUE.

The function can be a base R function or a function you include in the lapply() or sapply() call. For instance, to calculate the mean of the components of list1:

lapply(list1, mean, na.rm = TRUE)
$vec1
[1] 0.0144794

$vec2
[1] 4.745767

$vec3
[1] 9.735662

Here, lapply() returns a list. Using sapply() in addition to the arguments for lapply() we can set simplify = TRUE (which is TRUE by default) and use.names = TRUE (which is TRUE by default). We will keep these default values. To calculate the mean for every component in the list:

sapply(list1, mean, na.rm = TRUE)
     vec1      vec2      vec3 
0.0144794 4.7457673 9.7356624 

As you can see, this function returns a vector.

Let’s see what it would take to write the same code with a loop:

result_mean <- matrix(0, 1, 3)
for (i in 1:3) {
  result_mean[1, i] <- mean(list1[[i]])
}
colnames(result_mean) <- names(list1)
result_mean
          vec1     vec2     vec3
[1,] 0.0144794 4.745767 9.735662

Using sapply() you write this for loop in one line of code: sapply(list1, mean).

Like you could with the apply() function, you can define your own functions in both lapply() and sapply(). Recall that we used apply() to calculate a new value as the difference between the element in a column and the minimum to the difference between the minimum and the maximum. Using lapply() and reassigning these new values to list2:

list2 <- lapply(list1, function(x) (x - min(x))/(max(x) - min(x)))

You can now verify that all values in list2 are rescaled:

lapply(list2, range)
$vec1
[1] 0 1

$vec2
[1] 0 1

$vec3
[1] 0 1

Using sapply() and storing the values in mat1:

mat1 <- sapply(list1, function(x) (x - min(x))/(max(x) - min(x)))

You can verify this result (recall mat1 is a matrix):

apply(mat1, 2, range)
     vec1 vec2 vec3
[1,]    0    0    0
[2,]    1    1    1

Let’s revisit the first line lapply(list1, function(x) (x - min(x))/(max(x) - min(x))). Here you call lapply() to apply a function to every component of a list list1. In this case, the list’s components are vectors. The function to apply is function(x) (x - min(x))/(max(x) - min(x)). R will “loop over” every component of list1 and substitute that component for x in function(x). In other words, it applies that function to list1[[1]], then to lists[[2]] … until it reaches the end of the list. lapply() stores the result for every component in a list. sapply() has a similar way of applying a function, but simplifies the result, where possible, to a vector or matrix.

Note that you can have an apply() function within an lapply() function. If the components of a list are matrices and you would like to apply a function to every column of every matrix, you can use lapply(list, function(x) apply(x, 2, fun)).

list1 included only numeric vectors. In stud1 we had a mixture of data types. Most functions such as mean() or toupper() are only defined for a specific type of data. As the list can store many types, it is often convenient to first select the components of a list with the same type. Suppose you want to calculate the totals for all numeric vectors in stud1. First, we need to extract these vectors using a logical subsetting vector. To do so we will use the sapply() function to identify which components meet a condition and define a function that returns TRUE is the condition is met and FALSE otherwise. To select the numeric values, we can use the is.numeric() function within sapply(). This function will then return for every component of the list a value TRUE is that component is numeric and FALSE if that is not the case:

cond <- sapply(stud1, \(x) is.numeric(x))
cond
    name   number  program exchange   course    hours     ects semester 
   FALSE    FALSE    FALSE    FALSE    FALSE     TRUE     TRUE     TRUE 

Here, sapply() checks for very components in stud1 is this components is numeric. In other words, it tests is.numeric(stud1[[1]]), is.numeric(stud1[[2]]) … until it reaches the last component. For every component is.numeric() returns TRUE is the component is numeric and FALSE otherwise. sapply() stores each of these outcomes in a matrix or vector, here cond. In other words, cond is a logical vector whose elements are TRUE is a component of stud1 is numeric and FALSE otherwise. We can now use this logical vector to extract the components of ´stud1` that include numeric data. To do so, you can use

stud1[cond]
$hours
[1] 26 26 52 26 26

$ects
[1] 6 3 6 3 3

$semester
[1] 1 2

We now have the numeric components of stud1. Because we subsetted a list, the output is also a list. We can now use lapply() or sapply() to calculate the totals for all numeric vectors in the list stud1:

sapply(stud1[cond], function(x) sum(x))
   hours     ects semester 
     156       21        3 

4.4.7.2 Adding components to nested lists

With nested lists, the second level in the hierarchy is a list. Suppose now that you want to add a component to each list in the nested list. To illustrate, we’ll add the total number of hours for each student as an additional component to that list. You can subset the components of the lists on the second level within the lapply() or sapply() functions. For our example: for each student, the hours are stored in allstud$studenti$hours. lapply() applies a function to all studenti lists in allstud. Using this observation, including function(x) sum(x$hours) as a function in lapply(), R will ‘loop’ over each studenti and replace x with studenti. In doing so, R calculates the total hours for each student. lapply() returns a list:

lapply(allstud, function(x) sum(x$hours))
$student1
[1] 182

$student2
[1] 208

$student3
[1] 156

$student4
[1] 182

If you want to add these total hours each of the students in allstud, you can use the c() and add the component “totalhours” to each sublist in allstud. Here, I copy the result of this procedure in a new list. In the structure of this new list allstud_1´ you'll see that the component,totalhours` was added to each of the student’s list:

allstud_1 <- lapply(allstud, function(x) c(x, "totalhours" = sum(x$hours)))
str(allstud_1)
List of 4
 $ student1:List of 8
  ..$ name      : chr "Alice Wonderland"
  ..$ number    : chr "r00369258"
  ..$ program   : chr "Bachelor business adminstration"
  ..$ exchange  : logi FALSE
  ..$ course    : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
  ..$ hours     : num [1:5] 52 26 52 26 26
  ..$ ects      : num [1:5] 6 3 6 3 3
  ..$ totalhours: num 182
 $ student2:List of 8
  ..$ name      : chr "Bart Vader"
  ..$ number    : chr "r00362958"
  ..$ program   : chr "Bachelor business adminstration"
  ..$ exchange  : logi FALSE
  ..$ course    : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Financial statement analysis" ...
  ..$ hours     : num [1:5] 52 26 52 52 26
  ..$ ects      : num [1:5] 6 3 6 6 3
  ..$ totalhours: num 208
 $ student3:List of 8
  ..$ name      : chr "Clark Kent"
  ..$ number    : chr "r00362478"
  ..$ program   : chr "Bachelor business adminstration"
  ..$ exchange  : logi TRUE
  ..$ course    : chr [1:5] "Macro-economics and economic policy" "Economic sociology" "Entrepreneurship and business planning" "Financial accouing B" ...
  ..$ hours     : num [1:5] 52 26 26 26 26
  ..$ ects      : num [1:5] 6 3 3 3 3
  ..$ totalhours: num 156
 $ student4:List of 8
  ..$ name      : chr "Lois Lane"
  ..$ number    : chr "r00252478"
  ..$ program   : chr "Bachelor business adminstration"
  ..$ exchange  : logi FALSE
  ..$ course    : chr [1:5] "Macro-economics and economic policy" "Economic sociology" "Entrepreneurship and business planning" "Financial accouing B" ...
  ..$ hours     : num [1:5] 52 26 26 26 52
  ..$ ects      : num [1:5] 6 3 3 3 6
  ..$ totalhours: num 182
rm(all_stud1)
Warning in rm(all_stud1): object 'all_stud1' not found

Recall that in case you add a data structure such as a vector, matrix or array, you need to include that structure in a list() statement, e.g. c(x, "semester" = list(c(1, 2))).

Using sapply() returns similar results but as a matrix and not as a list:

sapply(allstud, function(x) sum(x$hours))
student1 student2 student3 student4 
     182      208      156      182 

Note that you can not use sapply() to add a component to a list: sapply() returns a vector and not a list. In other words, you can not use it to change a list.

A second way to access the components of the lists in a nested list uses the unlist() function. Recall that we can use unlist( ,recursive = FALSE) to unlist the first level. Doing so, returns the second level as a list. Using this level, you can now use lapply() or sapply(). Let’s extract the numeric vectors from the nested list allstud and calculate their sum. In the first step, we unlist allstud with the option recursive = FALSE and store the results in a list allstud_ul:

allstud_ul <- unlist(allstud, recursive = FALSE)

You can verify that we removed the first level from the allstud list. We can now proceed along the lines of the previous example:

cond <- sapply(allstud_ul, function(x) is.numeric(x))
sapply(allstud_ul[cond], function(x) sum(x))
student1.hours  student1.ects student2.hours  student2.ects student3.hours 
           182             21            208             24            156 
 student3.ects student4.hours  student4.ects 
            18            182             21 

You can now subset this result using the familiar vector or matrix subetting operations, e.g.

hours_ects <- sapply(allstud_ul[cond], function(x) sum(x))
hours <- hours_ects[grepl(pattern = ".hours", names(hours_ects))]
hours
student1.hours student2.hours student3.hours student4.hours 
           182            208            156            182 

If you want to remove the reference to hours in the names, you can use the familiar charachter functions, e.g.

names(hours) <- stringr::str_extract_all(names(hours), pattern = "student\\d", simplify = TRUE)
hours
student1 student2 student3 student4 
     182      208      156      182 

4.4.7.3 Searching for pattern across lists in a nested list

The lists that are part of a nested list include data. Sometimes you need to identify patterns that occur in some but not necessarily all sublists. For instance, in the example, the students list that are components of the nested list allstud include data on the courses they took. Suppose that you need to know which student took a specific course, e.g. “Economic sociology”. Visual inspection shows that there are three students who took “Economic sociology”: student1, student3 and student4. To find these students, you look for a pattern in studenti$course. That pattern is "Economic sociology. We want to subset all studenti$course components in every student list. To do so within lapply(), we use x$course. The function that we apply for every student’s list is to subset x$course using a logical vector that equals TRUE if “Economic sociology” is part of the vector with courses and FALSE otherwise. Here, we can use grepl(). Using this function in lapply() returns a logical vector

lapply(allstud, function(x) grepl(pattern = "Economic sociology", x = x$course))
$student1
[1] FALSE FALSE FALSE  TRUE FALSE

$student2
[1] FALSE FALSE FALSE FALSE FALSE

$student3
[1] FALSE  TRUE FALSE FALSE FALSE

$student4
[1] FALSE  TRUE FALSE FALSE FALSE

We can now use that vector to subset the course vector for every student. Recall that we can use a logical vector to subset vectors. Here, we do so using x$course[grepl(pattern = "Economic sociology", x = x$course)]. Note that the first x in x = x$course refers to grepl()’s argument name, not to the list’s components. Adding all these in lapply():

lapply(allstud, function(x) x$course[grepl(pattern = "Economic sociology", x = x$course)])
$student1
[1] "Economic sociology"

$student2
character(0)

$student3
[1] "Economic sociology"

$student4
[1] "Economic sociology"

The list includes all students and for student2, the component in that list is an empty character vector. In other words, this student doesn’t have this course in the course vector. Without subsetting x$course the function grepl() would show a list with logical indices.

Note that we can use lapply(allstud, function(x) grepl(pattern = "Economic sociology", x = x$course)) to subset other components in every student’s list. For instance, the length of “ects” or “hours” is equal to the length of the components in the logical vector. In other words, we can also extract the ects or hours included in the program of every student who took Economic sociology. Here, both hours and ects are the same as Economic sociology is the same course across students:

lapply(allstud, function(x) x$ects[grepl(pattern = "Economic sociology", x = x$course)])
$student1
[1] 3

$student2
numeric(0)

$student3
[1] 3

$student4
[1] 3

What about components such as “name” or “number”? Their length (1) is different from the length of the subsetting logical vectors. Here we can use the fact that TRUE = 1 and FALSE = 0. We looked for one pattern “Economic sociology”. If this pattern occurs in the “course” vector, lapply(allstud, function(x) grepl(pattern = "Economic sociology", x = x$course)) shows TRUE for that position and FALSE elsewhere. Summing across TRUE and FALSE will result in 1 if the subject is included and 0 if this is not the case:

lapply(allstud, function(x) sum(grepl(pattern = "Economic sociology", x = x$course)))
$student1
[1] 1

$student2
[1] 0

$student3
[1] 1

$student4
[1] 1

We can now use result to subset, e.g. the name and identify who took Economic sociology and who didn’t:

lapply(allstud, function(x) x$name[sum(grepl(pattern = "Economic sociology", x = x$course))])
$student1
[1] "Alice Wonderland"

$student2
character(0)

$student3
[1] "Clark Kent"

$student4
[1] "Lois Lane"

Suppose that you want to know the distribution of the value of a stock market portfolio 30 from now. You would like to answer questions such as: what is the probability that for every euro you invest today, the (nominal) value of your portfolio will rise to e.g. euro 10 in 30 years time, what is the probability that the your portfolio will be worth 5 euro’s in 30 years time. Because you can not predict the future with certainty, you decide to run a simulation to estimate this distribution. Using the simulation, you will generate “a lot of” 30 year periods. Using these results, you try to answer your questions. You assume that stock market returns (i.e. the percentage change in the value of your portfolio) are normally distributed. The parameters of this normal distribution - the mean and the standard deviation - equal the average percentage change and the volatility. For instance, if you assume that the yearly mean is 8% and the yearly volatility if 20%, then you know that in any given year, the return will be between -12% and +28% in 68,2% of all years and will be between -32% and + 48% ion 95.4% of all years. However, you are not sure of the mean will be 8%. Some portfolio’s have a lower expected return. Usually, they also have a lower volatility. On the other hand, some portfolio’s also have a higher expected return. In that case, their volatility is higher. You also want to run look at returns per month. Doing so allows you to have 360 months in year simulation and not 30 years. In other words, you will run simualtions using the following combinations of expected return and volatility:

  • yearly: 6% and 12.00% - monthly: 0.48676% and 3.46410%
  • yearly: 7% and 15.75% - monthly: 0.56541% and 4.54663%
  • yearly: 8% and 20.00% - monthly: 0.64340% and 5.77350%
  • yearly: 9% and 22.25% - monthly: 0.72073% and 6.42302%
  • yearly: 10% and 30.00% - monthly: 0.79741% and 8.66025%

You store these values in a matrix, mat_data. This matrix is given:

mat_data <- matrix(c(0.48676, 0.56541, 0.64340, 0.72073, 0.79741, 3.46410, 4.54663, 5.77350, 7.14471, 8.66025)/100, nrow = 5, ncol = 2)
colnames(mat_data) <- c("exp_ret", "vol")
rownames(mat_data) <- paste("sim", 1:5, sep = "_")
mat_data
        exp_ret       vol
sim_1 0.0048676 0.0346410
sim_2 0.0056541 0.0454663
sim_3 0.0064340 0.0577350
sim_4 0.0072073 0.0714471
sim_5 0.0079741 0.0866025

How do you run this simulation? For every monthly return - volatility combination (i.e. for every row in mat_data), you draw 360 random draws from a normal distribution. To see the total value after 360 months, you want 1 and calculate the cumulative product. To see this, not that every euro invested will be worth

\[ (1 + r_1) \] after one month, \[ (1+ r_1)(1 + r_2) \] after two months, … . In other words,

\[ (1 + r_1)(1 + r_2) ... (1 + r_{360}) \] will be the value after 360 months or 30 years.

Here you draw the r’s from the normal distribution. If you then add 1, every value will equal \(1 + r_1\). The cumulative product will then show your total value after 360 months. Here, you have one simulation but you need a “large number” of these simulation to answer you question. So, for every return - volatility combination, you generate this simulation 100 times.

To store the results, we will use a list for every return- volatility pair and call it simi where i refers to the row in mat_data. We will store the expected return and volatility as simi$exp_return and simi$volatility. Because you are not sure you will need these results for other time periods as well, you store the returns matrix in simi$sim_data. After the simulation, you add the 100 results in a matrix and add it to simi$exp_value. Your results allow you to estimate the quantiles of the value distribution. You will store them as simi$quantiles. In addition, you store the values such as the mean in simi$mean and the standard deviation in simi$st_dev. The last thing you want to store is the histogram of the final values in simi$plot. For every expected - return volatility combination, you have a separate list. You store this lists in a list simulations.

Let’s create the lists first

  • create an empty list:
Code
simulations <- list()

Let’s look at the simulation for the first return - volatility pair.

  • create a list, sim1 and add expected return sim1$exp_return and volatility sim1$volatility to the list. Recall that these values are stored in the first row in mat_data:
Code
sim1 <- list(exp_return = mat_data[[1, 1]], 
             volatility = mat_data[[1, 2]])
sim1
$exp_return
[1] 0.0048676

$volatility
[1] 0.034641
Code
# Note that there are other ways to do to. For instance, you could have 
# created an empty list `sim1 <- list()` and used `sim1$exp_return <- mat_data[1, 1]` 
# started from the empty list and used  `sim1 <- c(sim1, "exp_return" = mat_data[1, 1])`. 
  • add this list to simulations with the name sim1
simulations[["sim1"]] <- sim1

# Note that there are other ways to do this, e.g. `simulations$sim1 <- sim1`. 

Let’s automate this for the other lists. Here the code is given. Try to predict what every line in this code does. Note that sim1 was created. In other words, i can start from 2 and needs to run to 5. Focus on the lines that deal with “lists”.

for (i in 2:5) {
  sim_names <- paste0("sim", i)
  temp_list <- list(exp_return = mat_data[[i, 1]],
                    volatility = mat_data[[i, 2]])
  simulations[[sim_names]] <- temp_list
}
rm(temp_list)

Use the values in sim1$exp_return and sim1$volatilitysim5$exp_return and sim5$volatility to generate a 360 x 100 matrix with random draws from a normal distribution with mean and standard deviation given by exp_return and volatility, add 1 to every element and add this matrix to sim1sim5 Do this so that you can rerun the simulations with another set of parameters for the months and draws. In other words, assign the values for the number of draws, ndraws and the number of months nmonths is separate variables. Use these to determine the dimensions of your matrix. Assign this matrix to simi$sim_data.

First let’s look at an example to generate the matrix. Here, call this matrix mat and use the data stored in sim1 to set the mean and standard deviation:

Code
ndraws <- 100
nmonths <- 360

# we need ndraws per month: total of ndraw * nmonths random draws
# store in ndraw columns with one row per month

mat <- matrix(rnorm(n = (ndraws * nmonths), 
                    mean = simulations$sim1$exp_return, 
                    sd = simulations$sim1$volatility),
              nrow = nmonths, 
              ncol = ndraws) + 1

Let’s try to automate this process using the lapply() function and add the matrix sim_data to every list witing the simulations list. Recall that you need to wrap the matrix in a list() call. Use function(x) c(list "name" = ) in the lapply() function to do so:

Code
simulations <- lapply(simulations, function(x) c(x, "sim_data" = list(matrix(rnorm(ndraws * nmonths, x$exp_return, x$volatility), nmonths, ndraws) + 1)))

Let’s see what the alternative would have been is you would have use a for loop. Here, the code is given. Try to see what these steps do with respect to lists in this simulation (how are they subsetted …).

# for (i in 2:5) {
#   
#   simulations[[i]]$sim_data <- matrix(rnorm(ndraws * nmonths, simulations[[i]]$exp_return, simulations[[i]]$volatility),
#                                       nrow = nmonths,
#                                       ncol = ndraws) + 1
# }

Verify that your results are from the correct normal distribution. To do so, use the sapply() function to create a est_mean and est_volatility matrix as mean and standard deviation of all elements in the sim_data matrix minus 1 (recall that you added one, so here, for this purpose you need to subtract 1):

est_mean <- sapply(simulations, function(x) mean(x$sim_data - 1))
est_volatility <- sapply(simulations, function(x) sd(x$sim_data - 1))

You can now use this matrix to determine the value for every one of these 100 draws after 360 months. Assign this vector to simi$exp_value. Recall that you can use the apply() function to calculate the product of all values in a column of a matrix and that you need to simplify the result of apply(). Use the lapply() function to do generate these vectors across the various simulations.

Code
simulations <- lapply(simulations, function(x) c(x, "exp_value" = list(apply(x$sim_data, 2, FUN = prod, simplify = TRUE))))

You now have for every euro invested today the value for every euro invested 30 years from now for 5 scenario’s in terms of the expected return and volatility and for 100 simulations across these return-volatility combinations. Use these values to calculate summary statistics: quantiles (with probabilities 10%, 25%, 50%, 75% and 90%), mean and standard deviation. Store these in simi$quantiles, simimean and simi$st_dev. You will need three lines of code using lapply():

Code
simulations <- lapply(simulations, function(x) c(x, "quantiles" = list(quantile(x$exp_value, probs = c(0.10, 0.25, 0.50, 0.75, 0.90), names = TRUE))))
simulations <- lapply(simulations, function(x) c(x, "mean" = mean(x$exp_value, na.rm = TRUE)))
simulations <- lapply(simulations, function(x) c(x, "st_dev" = sd(x$exp_value, na.rm = TRUE)))

Now you can generate a plot. Here is the code to generate the plot for sim1. Try to read it to see what the code is doing. Use ?hist or ?plot to see what these lines are doing:

plot_sim <- hist(simulations$sim1$exp_value, probability = TRUE)

plot(plot_sim, col = "lightyellow", border = "lightgrey", 
     xlab = "Expected value", 
     main = glue::glue("Simulation with expected return {simulations$sim1$exp_return} and volatility {simulations$sim1$volatility}"))

Now, generate this plot and store this plot in $plot_sim in every simualations. You can use lapply() to do so. You can leave the plot() code out and only use the part in hist() from the previous code.

Code
simulations <- lapply(simulations, function(x) c(x, "plot_sim" = list(hist(x$exp_value, probability = TRUE))))

You now have all your data for your simulations. Now, lets take a closer look at some of the results and answer a couple of questions. Store each answer in a matrix or list as indicated in the question.

  • What is the mean value in each of the 5 simulations? Store this result in a matrix, mean_sim:
Code
mean_sim <- sapply(simulations, function(x) x$mean)
mean_sim
     sim1      sim2      sim3      sim4      sim5 
 5.984450  8.213977 11.126300 12.941953 12.151095 
  • What is the 10th and 90th percentile in each of these simulations? Store this result in a matrix, low_value and high_value:
Code
low_value <- sapply(simulations, function(x) x$quantiles[1])
high_value <- sapply(simulations, function(x) x$quantiles[5])
low_value
 sim1.10%  sim2.10%  sim3.10%  sim4.10%  sim5.10% 
1.9962861 1.6127327 1.5882490 1.2373738 0.6525462 
Code
high_value
sim1.90% sim2.90% sim3.90% sim4.90% sim5.90% 
11.08629 18.71673 24.83965 33.50628 30.25813 
  • Calculate the standard deviation for each simulation’s expected value. Store this result in a matrix vol_sim:
Code
vol_sim <- sapply(simulations, function(x) x$st_dev)
vol_sim
     sim1      sim2      sim3      sim4      sim5 
 4.757061  9.713904 14.209238 18.484862 19.995729 
  • Which simulation run, in each of the simulations, gave the highest value? Store this result in a list max_run. Do the same for the lowest value in store in min_run:
Code
max_run <- lapply(simulations, function(x) which.max(x$exp_value))
min_run <- lapply(simulations, function(x) which.min(x$exp_value))
  • Select the sim5$sim_data column associated with the highest expected value for every simulation. Store in a vector test:
Code
test <- simulations$sim5$sim_data[, max_run$sim5]
  • Test if the value of the product of test is equal to the maximum of the expected values for sim5:
Code
prod(test) - simulations$sim5$exp_value[[max_run$sim5]] < 10^(-12)
[1] TRUE
  • Given the mean, you can calculate the expected value if there wouldn’t by any volatility as

\[ (1 + r)^{360} \]

  • Calculate for every simulation how many runs are below this level. Store the result in a list below_ave:
Code
below_ave <- lapply(simulations, function(x) sum(x$exp_value < (1 + x$exp_return)^(nmonths)))
below_ave
$sim1
[1] 64

$sim2
[1] 66

$sim3
[1] 64

$sim4
[1] 74

$sim5
[1] 79
  • Is the mean expected value less than the expected value without volatility? Store this in a list with logical values diff_mean_bool:
Code
diff_mean_bool <- lapply(simulations, function(x) (x$mean - (1 + x$exp_return)^(nmonths)) < 0)
diff_mean_bool
$sim1
[1] FALSE

$sim2
[1] FALSE

$sim3
[1] FALSE

$sim4
[1] TRUE

$sim5
[1] TRUE
  • How large is that difference between the actual mans and the mean without volatility? Store this is a list diff_mean:
Code
diff_mean <- lapply(simulations, function(x) (x$mean - (1 + x$exp_return)^(nmonths)))
diff_mean
$sim1
[1] 0.2408568

$sim2
[1] 0.6018453

$sim3
[1] 1.063751

$sim4
[1] -0.3256147

$sim5
[1] -5.298055
  • Use the layout for the last histogram to plot the histogram for every simulation
Code
lapply(simulations, function(x) 
  plot(x$plot_sim, col = "lightyellow", border = "lightgrey", 
     xlab = "Expected value", 
     main = glue::glue("Simulation with expected return {x$exp_return} and volatility {x$volatility}")))

$sim1
NULL

$sim2
NULL

$sim3
NULL

$sim4
NULL

$sim5
NULL

You can verify your plots in the plots tab in the environment pane. The arrow to the left should allow you to see the 5 plots including a different title.

 #| echo: false
 #| error: false
 #| message: false
 #| output: false
 #| warning: false

rm(course, cond, ects, hours, stud1, stud2, stud3, stud4, allstud, mat1, list1, plot_norm)

4.5 Data frames and tibbles

You can think about data frames as lists where each column has the same length (as in a matrix) but each column can store a different type of data (as in a list). As in a matrix, a data frame usually has a fixed set of rows and columns but as in a list, these columns can store different types of variables. We will also use a special type of data frame: a tibble. Tibbles are essentially data frames, but with some additional characteristics.

4.5.1 Creating a data frame

4.5.1.1 The basics

To create a data frame, you can use the date.frame() function. The first argument are the data for the data frame. In addition, you can add row.names = NULL. By default, R doesn’t add row names other than 1, 2, 3, …. Adding a vector (integer or character) with the row names of specifying which column R needs to use for row names changes that default. Two other arguments check the data: check.rows = FALSE checks if the rows are consistent in terms of their length and in terms of their names; check.names = TRUE checks the names of the variables to see if these are valid variables names and not duplicates. The last two arguments, fix.empty.names = TRUE and stringAsFactors = FALSE add an automatically generated name in case the variable names are empty and changes character variables in factors. Let’s create a data frame, df whose values include numbers, logical values, characters and dates:

df <- data.frame(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df
  numbers bools characters      dates
1       1  TRUE          a 2025-03-25
2       2 FALSE          b 2025-03-26
3       3 FALSE          c 2025-03-27
4       4  TRUE          d 2025-03-28
5       5  TRUE          e 2025-03-29

You can verify that this is a data frame using e.g.

is.data.frame(df)
[1] TRUE

or from the class

class(df)
[1] "data.frame"

Note that a data frame is also a list:

is.list(df)
[1] TRUE

Checking the structure of df

str(df)
'data.frame':   5 obs. of  4 variables:
 $ numbers   : num  1 2 3 4 5
 $ bools     : logi  TRUE FALSE FALSE TRUE TRUE
 $ characters: chr  "a" "b" "c" "d" ...
 $ dates     : Date, format: "2025-03-25" "2025-03-26" ...

you can see that this structure shows similarities with a named list. From the structure, you can also see that this data frame includes 5 observations for 4 variables. The structure also shows the type of each variable. The length() or ncol() show the number of variables, while nrow() shows the number of observations:

length(df)
[1] 4
ncol(df)
[1] 4
nrow(df)
[1] 5

Recall that ncol() and nrow() allowed you to determine the dimensions of a matrix. To see access the names of the variables, you can use

names(df)
[1] "numbers"    "bools"      "characters" "dates"     

R returns a character vector with the names of the variables. If the data includes row names, you can ask see them using

row.names(df)
[1] "1" "2" "3" "4" "5"

A data frame’s columns must have the same length (nrows) and R will sometimes force this to happen. To see this, let’s change a couple of arguments in df <- data.frame():

  • numbers is a numeric value, not a vector of 5 values:
df1 <- data.frame(numbers = 10,  bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df1
  numbers bools characters      dates
1      10  TRUE          a 2025-03-25
2      10 FALSE          b 2025-03-26
3      10 FALSE          c 2025-03-27
4      10  TRUE          d 2025-03-28
5      10  TRUE          e 2025-03-29

R copies the value “10” and fills the column “numbers” until the number of values equals the number of rows in the data frame. This is called recycling. R recycles single numeric values to fill a column.

  • the boolean vector includes only 3 values, not 5:
df2 <- data.frame(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
Error in data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F), characters = letters[1:5], : arguments imply differing number of rows: 5, 3
df2
Error: object 'df2' not found

Here, R produces an error. It can not fill the bools column to make sure that its number of values matches the number of rows in the data frame. As R doesn’t know what to do, it will not fill this data frame.

  • the character vector includes more than 5 values:
df3 <- data.frame(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F, T, T), characters = letters[1:8], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
Error in data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, : arguments imply differing number of rows: 5, 8
df3
Error: object 'df3' not found

Here too, R will not execute this command. In this case, R doesn’t know which values to drop from the character vector.

However, with one value, R recycles the character:

df4 <- data.frame(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F, T, T), characters = letters[1], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df4
  numbers bools characters      dates
1       1  TRUE          a 2025-03-25
2       2 FALSE          a 2025-03-26
3       3 FALSE          a 2025-03-27
4       4  TRUE          a 2025-03-28
5       5  TRUE          a 2025-03-29

To see what the other arguments in the date.frame() function, let’s add them and see how they change the output.

  • specifying row.names = 3L uses the third column of the data as row names:
df <- data.frame(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"), 
                 row.names = 3L)
df
  numbers bools      dates
a       1  TRUE 2025-03-25
b       2 FALSE 2025-03-26
c       3 FALSE 2025-03-27
d       4  TRUE 2025-03-28
e       5  TRUE 2025-03-29
  • As an alternative, you can add vector with names: c("Obs.A", "Obs.B", "Obs.C", "Obs.D", "Obs.E"):
df <- data.frame(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"), 
                 row.names = c("Obs.A", "Obs.B", "Obs.C", "Obs.D", "Obs.E"))
df
      numbers bools characters      dates
Obs.A       1  TRUE          a 2025-03-25
Obs.B       2 FALSE          b 2025-03-26
Obs.C       3 FALSE          c 2025-03-27
Obs.D       4  TRUE          d 2025-03-28
Obs.E       5  TRUE          e 2025-03-29
  • let’s remove the name bools and see what the function returns:
df <- data.frame(numbers = c(1, 2, 3, 4, 5), c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df
  numbers c.T..F..F..T..T. characters      dates
1       1             TRUE          a 2025-03-25
2       2            FALSE          b 2025-03-26
3       3            FALSE          c 2025-03-27
4       4             TRUE          d 2025-03-28
5       5             TRUE          e 2025-03-29
str(df)
'data.frame':   5 obs. of  4 variables:
 $ numbers         : num  1 2 3 4 5
 $ c.T..F..F..T..T.: logi  TRUE FALSE FALSE TRUE TRUE
 $ characters      : chr  "a" "b" "c" "d" ...
 $ dates           : Date, format: "2025-03-25" "2025-03-26" ...

Here, R creates the name of the logical variable from the vector c(T, F, F, T, R). Is does so by removing the brackets and replacing comma’s and spaces with dots. If you include the ceck.names = FALSE argument, R will use c(T, F, F, T, R) as a name. If you want to avoid this, you need to use fix.empty.names = FALSE.

df <- data.frame(numbers = c(1, 2, 3, 4, 5), c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"), 
                 fix.empty.names = FALSE)
df
  numbers       characters      dates
1       1  TRUE          a 2025-03-25
2       2 FALSE          b 2025-03-26
3       3 FALSE          c 2025-03-27
4       4  TRUE          d 2025-03-28
5       5  TRUE          e 2025-03-29
str(df)
'data.frame':   5 obs. of  4 variables:
 $ numbers   : num  1 2 3 4 5
 $           : logi  TRUE FALSE FALSE TRUE TRUE
 $ characters: chr  "a" "b" "c" "d" ...
 $ dates     : Date, format: "2025-03-25" "2025-03-26" ...

Here, R leaves the name of the variable empty. You can now set your own name. The last argument stringAsFactors = FALSE keeps characters as characters. Changing this into TRUE converts these characters into factors.

4.5.1.2 Tibbles and data frames

Tibbles are essentially data frames but come with a couple of special features. First, to use tibbles, you need to load the tibble package included in the tidyverse suite of packages. Second, there are a couple of differences in how a tibble and a data frame handle, e.g. printing or subsetting. First, if you print a tibble, it will highlight some special features and will only show the 10 first observations. Data frames show all observations. For long datasets, you need to add a command telling R to show only e.g. 10 lines. Second, tibbles are more strict in terms of subsetting compared to data frames. As we’ll see, a tibble always returns a tibble, while a data frame can return a vector. Last, tibbles allow for non syntatic column names, e.g. var 1.

With respect to the creating of a tibble, the basics are very similar to those for data frames.

To illustrate, let’s create a tibble:

df_tib <- tibble::tibble(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df_tib
# A tibble: 5 × 4
  numbers bools characters dates     
    <dbl> <lgl> <chr>      <date>    
1       1 TRUE  a          2025-03-25
2       2 FALSE b          2025-03-26
3       3 FALSE c          2025-03-27
4       4 TRUE  d          2025-03-28
5       5 TRUE  e          2025-03-29

and compare the result with the date frame:

df <- data.frame(numbers = c(1, 2, 3, 4, 5),  bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df
  numbers bools characters      dates
1       1  TRUE          a 2025-03-25
2       2 FALSE          b 2025-03-26
3       3 FALSE          c 2025-03-27
4       4  TRUE          d 2025-03-28
5       5  TRUE          e 2025-03-29

The first thing to note is that result shows the number of rows and columns for a tibble, but not for a data frame. In addition, the tibble also shows the type of the data stored in each column, while the data frame doesn’t show this output. Row names in the tibble are shown in grey, indicating that they were automatically generated. You can verify the class of a tibble:

class(df_tib)
[1] "tbl_df"     "tbl"        "data.frame"

Here, you can see that a tibble is also a data frame. The tibble() function includes 2 arguments in addition to data part: .rows = and .name_repair = c("check_unique", "unique", "universal", "minimal"). The former allows you to add the number of rows. You could add this as a check to see if the number of observations in your dataset matches your expectations or to create an empty tibble using .rows = 0. The latter function allows you to tell R how to treat problematic column names. The default value here is check_unique and verifies if a column has a unique name but doesn’t try to repair the name; universal makes names unique and brings them in line with the R syntax; unique makes sure that there are names that that they are unique while `minimal does not repair or any checks other than verifying is a name exits.

4.5.1.3 Showing parts of a data frame or tibble

Using head(df, n = ) or tail(df, n = ) you can print the first (head) or last (tail) n lines of a data frame or tibble. Suppose you want to see the first 2 lines of df you would use:

head(df, n = 2)
  numbers bools characters      dates
1       1  TRUE          a 2025-03-25
2       2 FALSE          b 2025-03-26

To see that last 3 lines of df_tib:

tail(df_tib, n = 3)
# A tibble: 3 × 4
  numbers bools characters dates     
    <dbl> <lgl> <chr>      <date>    
1       3 FALSE c          2025-03-27
2       4 TRUE  d          2025-03-28
3       5 TRUE  e          2025-03-29

4.5.1.4 Coercing objects to data frame

Using as.data.frame() you can change another object in a data frame. Here, the arguments are largely the same as those for data.frame, with the exception that now you need to include an object you want to change into a data frame. For instance, let’s create a 2x3 matrix and add names:

mat <- matrix(round(runif(15), 2), 3, 5)
colnames(mat) <- paste("var", 1:5, sep = "_")
rownames(mat) <- paste("obs", 1:3, sep = "_")
mat
      var_1 var_2 var_3 var_4 var_5
obs_1  0.52  0.71  0.46  0.93  0.25
obs_2  0.34  0.32  0.76  0.12  0.37
obs_3  0.94  0.46  0.39  0.81  0.84

Changing this matrix in a data frame, using as.data.frame():

mat_df <- as.data.frame(mat)
mat_df
      var_1 var_2 var_3 var_4 var_5
obs_1  0.52  0.71  0.46  0.93  0.25
obs_2  0.34  0.32  0.76  0.12  0.37
obs_3  0.94  0.46  0.39  0.81  0.84
str(mat_df)
'data.frame':   3 obs. of  5 variables:
 $ var_1: num  0.52 0.34 0.94
 $ var_2: num  0.71 0.32 0.46
 $ var_3: num  0.46 0.76 0.39
 $ var_4: num  0.93 0.12 0.81
 $ var_5: num  0.25 0.37 0.84

Note that R used the row and column names of the matrix to add row and column names to the data frame. You can use your own row names if you add them via row.names = c() to the as.data.frame() function. Note that you can change a date frame (with only numeric variables) into a matrix. This allows you to use matrix operators (matrix algebra). Often this is much faster than writing code to perform the same calculations on a data frame. Using as.data.frame() you can then change the type of your matrix back into a data frame.

You can also change other objects in a data frame. For instance, here is a list

list1 <- list(
  company = c("Firm A", "Firm B", "Firm C", "Firm D", "Firm E"),
  sales = runif(5, min = 100000, max = 1000000),
  margin = runif(5, min = 0.20, max = 0.36),
  region = as.factor(c(1, 1, 2, 2, 2)))

Using as.data.frame():

list1_df <- as.data.frame(list1)
list1_df
  company    sales    margin region
1  Firm A 733606.0 0.2465040      1
2  Firm B 554201.6 0.3128367      1
3  Firm C 860078.6 0.2024773      2
4  Firm D 789595.5 0.2356981      2
5  Firm E 608390.8 0.2843190      2

Changes the list into a data frame. What happens with nested lists? To see this, let’s generate a second list:

list2 <- list(
  company = c("Firm F", "Firm G", "Firm H", "Firm I", "Firm J"),
  sales = runif(5, min = 1000, max = 10000),
  margin = runif(5, min = 0.10, max = 0.16),
  region = as.factor(c(1, 1, 3, 3, 3)))

and create a nested list lest_nest using list1 and list2

list_nest <- list(list1, list2)

Changing list_nest into a data frame:

list_nest_df <- as.data.frame(list_nest)
list_nest_df
  company    sales    margin region company.1  sales.1  margin.1 region.1
1  Firm A 733606.0 0.2465040      1    Firm F 9041.202 0.1170825        1
2  Firm B 554201.6 0.3128367      1    Firm G 3581.459 0.1576101        1
3  Firm C 860078.6 0.2024773      2    Firm H 9732.313 0.1462119        3
4  Firm D 789595.5 0.2356981      2    Firm I 3881.885 0.1210791        3
5  Firm E 608390.8 0.2843190      2    Firm J 2148.158 0.1326172        3

creates a data frame of 8 variables and 5 observations, not a data frame with 4 variables and 10 observations. In other words, here, you’ll need to change the lists on the second level into data frames first e.g. using

list_nest <- lapply(list_nest, function(x) as.data.frame(x))

and then use list_nest to extract the data frames. If all data frames in the nested list include the same variables, you can use rbind() to add them into one data frame. We will discuss rbind() for data frames more in depth in the next section. However, recall that you have used this function to add rows for matrices.

Using as_tibble() you need to specify the object will be changed in a tibble. In addition, you can add the .name_repair = c("check_unique", "unique", "universal", "minimal") argument to repair names. Note that as.tibble (with a dot) also exists. This function has been replaced by as_tibble(). For instance, to coerce a matric into a tibble:

mat_tib <- tibble::as_tibble(mat)
mat_tib
# A tibble: 3 × 5
  var_1 var_2 var_3 var_4 var_5
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  0.52  0.71  0.46  0.93  0.25
2  0.34  0.32  0.76  0.12  0.37
3  0.94  0.46  0.39  0.81  0.84

Note that as_tibble() doesn’t include the row names. To do so, you need to add a variable where R can store the row names in a tibble. To so do, you add the argument rownames = "name" in the as_tibble() function. Doing so, the function will add the rownames from mat as a separate variable to the tibble. The name of this variable is name. For instance, adding the row names of mat to a variable rows in the mat_tib

mat_tib <- tibble::as_tibble(mat, rownames = "rows")
mat_tib
# A tibble: 3 × 6
  rows  var_1 var_2 var_3 var_4 var_5
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 obs_1  0.52  0.71  0.46  0.93  0.25
2 obs_2  0.34  0.32  0.76  0.12  0.37
3 obs_3  0.94  0.46  0.39  0.81  0.84

You can also change a data frame into a tibble:

df_tib <- tibble::as_tibble(mat_df)
df_tib
# A tibble: 3 × 5
  var_1 var_2 var_3 var_4 var_5
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  0.52  0.71  0.46  0.93  0.25
2  0.34  0.32  0.76  0.12  0.37
3  0.94  0.46  0.39  0.81  0.84

If your data frame has row names and you would like to keep them, you need to add rownames = "name in the as_tibble() function:

df_tib <- tibble::as_tibble(mat_df, rownames = "abcdef")
df_tib
# A tibble: 3 × 6
  abcdef var_1 var_2 var_3 var_4 var_5
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 obs_1   0.52  0.71  0.46  0.93  0.25
2 obs_2   0.34  0.32  0.76  0.12  0.37
3 obs_3   0.94  0.46  0.39  0.81  0.84

4.5.1.5 Functions returning a data frame

Many functions used to import data in R return a data frame, for instance: read.csv will import tabular data and return a data frame. This same also holds for many other packages that allow you to import data. We will cover examples in Chapter 6.

4.5.2 Subsetting

4.5.2.1 Subsetting a date frame or tibble: variables (columns)

Recall that a data frame borrows characteristics from both a list and a matrix. In other words, you can use both the column number as well as its name to subset a column. Recall that there are 3 substting operators: [], [[]] and $. Let’s use each of them with a data frame.

  • [] with column indices or column names:
df[1]
  numbers
1       1
2       2
3       3
4       4
5       5
df["numbers"]
  numbers
1       1
2       2
3       3
4       4
5       5
class(df["numbers"])
[1] "data.frame"

As you can see, here too, as with lists, the [] operator is a preserving operator: it preserves that characteristics of the data frame. Note also see that the syntax resembles the list-syntax: you only include the column you want to extract and you don’t use e.g. [, 2] or [, "numbers"]. In other words, you don’t include an index for the rows. With data frames, R assumes you need all rows of the column. If you apply the operator to a tibble, the tibble structure will be preserved as well. Note the difference with subsetting a column with a matrix. There, we used mat[, i] to extract the ith column. Here, there is no reference to a row. You could use df[, 1] to subset the first column. However, in that case, R will return an unnamed vector if the subsetting is applied to a data frame. In other words, R simplifies the result as much as possible: treating a df as a matrix, causes R to simplify the output if possible. Doing the same with a tibble, will not cause a simplified result. Applied to a tibble, `[, 1] will return a tibble.

df[, 1]
[1] 1 2 3 4 5
is.vector(df[, 1])
[1] TRUE
  • [[]] with column indices or column names
df[[1]]
[1] 1 2 3 4 5
df[["numbers"]]
[1] 1 2 3 4 5
class(df[["numbers"]])
[1] "numeric"
is.vector(df[["numbers"]])
[1] TRUE

This operator returns a simplified result. Here, the first column is no longer a data frame but a vector. In other words [[]] act, as was the case with lists, as the simplifying operator.

  • $ with column names
df$numbers
[1] 1 2 3 4 5
class(df$numbers)
[1] "numeric"
is.vector(df$numbers)
[1] TRUE

As was the case with lists, the $ operator with a data frame returns a simplified result. In other words, df$numbers is equivalent to df[["numbers]]. The $ operator is the most widely used to subset columns in a data frame or tibbles. However, there is one difference between data frames and tibbles. Data frames allows for partial matching while tibbles don’t. For instance, with a data frame:

df$numb
[1] 1 2 3 4 5

will work even if there is not variable numb. Doing so with a tibble wouldn’t work:

df_tbl <- tibble::as_tibble(df)
df_tbl$numb
Warning: Unknown or uninitialised column: `numb`.
NULL

as you can see, R didn’t extract the values and gave a warning message.

With respect the multiple columns of negative index positions, data frames and tibbles are comparable to lists, vectors or matrices: a negative index position extracts all but the column with the negative index

df[-4]
  numbers bools characters
1       1  TRUE          a
2       2 FALSE          b
3       3 FALSE          c
4       4  TRUE          d
5       5  TRUE          e

and selecting two ore more columns is similar to lists or matrices, e.g.:

df[c(1, 4)]
  numbers      dates
1       1 2025-03-25
2       2 2025-03-26
3       3 2025-03-27
4       4 2025-03-28
5       5 2025-03-29

You can extract a column using the pipe operator. Using base R’s pipe:

df |> _$numbers
[1] 1 2 3 4 5

returns df$numbers. This holds also for tibbles. Note that in case you would use magrittr pipe, you would need to change the _ in a dot ..

4.5.2.2 Subsetting individual elements of a data frame or tibble

There are three ways to subset an individual value. They all return the same output:

df[2, 3]
[1] "b"
df[[2, 3]]
[1] "b"
df$characters[2]
[1] "b"

Negative indices extract all but that value, e.g.

df[-2, 3]
[1] "a" "c" "d" "e"

extracts all but the second row of the third column of df.

Here, there is no difference between a tibble and a data frame.

4.5.2.3 Subsetting using a logical index

Data frames show a lot of similarities with other data structures in terms of how you can use logical vectors to subset columns of rows. For instance extracting the dates on the condition that the value in the column numbers is larger than 2:

cond <- df$numbers > 2
df$dates[cond]
[1] "2025-03-27" "2025-03-28" "2025-03-29"

or selecting multiple columns conditional upon numbers being larger than 2:

df[cond, 1:3]
  numbers bools characters
3       3 FALSE          c
4       4  TRUE          d
5       5  TRUE          e

As you could with the other data structures you can also extract columns using e.g. grepl(). Extracting variables whose name includes “numbers” or “dates” for instance, can be done using:

df[grepl(pattern = "numbers|dates", colnames(df))]
  numbers      dates
1       1 2025-03-25
2       2 2025-03-26
3       3 2025-03-27
4       4 2025-03-28
5       5 2025-03-29

As an alternative, the subset(x, subset, select, drop = FALSE, ...) function allows you to select the variables in a data frame df in select using a condition in subset. For instance, selecting columns “numbers”, “bools” and “character” for the rows where “numbers” is larger than 2:

subset(df, df$numbers > 2, c("numbers", "bools", "characters"))
  numbers bools characters
3       3 FALSE          c
4       4  TRUE          d
5       5  TRUE          e

Recall that you extracted these values also using df[df$numbers > 2, 1:3].

In subsequent chapters, we’ll use {dplyr}’s filter() and select() function to selects observations (filter()) and variables (select).

4.5.3 Changing a data frame/tibble

4.5.3.1 Changing individual elements

Changing individual elements of a data frame is straightforward: you reassign their value as you did for vectors or matrices.

4.5.3.2 Adding rows or columns

With respect to data frames, you can use cbind() and rbind() to add columns and rows to a data frame. These columns can be stored in vector, matrices or data frames. Recall that we used these function also for matrices. Suppose that you have a data frame df1 and vectors D and E. As you can see, df1 has 4 rows and 3 variables. As you may recall from the section on matrices, this means the columns you want to add need at least 4 rows and the rows you want to add need at least 3 columns.

df1 <- data.frame(A = c(11, 21, 31, 41), B = c(12, 22, 32, 42), C = c(13, 23, 33, 43))
D <- c(14, 24, 34, 44)
E <- c(51, 52, 53, 54)

Let’s now use cbind() to add the vector D to df1:

cbind(df1, D)
   A  B  C  D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44

Here, R used the name of the vector as a variable name in the data frame df1. What if the vector is not named. To see what happens, let’s use

cbind(df1, c(10, 11, 12, 13))
   A  B  C c(10, 11, 12, 13)
1 11 12 13                10
2 21 22 23                11
3 31 32 33                12
4 41 42 43                13

As you can see, R selects a name from the values of the vector that was added. In other words, if the vector or matrix isn’t named, you need to add names before using cbind() or set names afterwards. Recall that you can create a component in a list using list$component <- .... As data frames are lists, you can use the same approach to add a new variable to a dataset. For instance, to add the vector D to df1 you can also use

df1$D <- c(14, 24, 34, 44)
df1
   A  B  C  D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44

Adding rows uses rbind(). Adding rows to a data frame is only relevant when the row you add include observations for the same variables. Suppose that the vector E included observations for variables A, B and C. Using rbind() you can add them to the data frame:

rbind(df1, E)
   A  B  C  D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44
5 51 52 53 54

To add a data frame df2

df2 <- data.frame(G = c(18, 28, 38, 48),  H = c(19, 29, 39, 49))

to df1, you can use the same functions. For instance, adding the columns of df2 to those of df1:

cbind(df1, df2)
   A  B  C  D  G  H
1 11 12 13 14 18 19
2 21 22 23 24 28 29
3 31 32 33 34 38 39
4 41 42 43 44 48 49

and adding the rows of df3

df3 <- data.frame(A = c(51, 61), B = c(52, 62), C = c(53, 63), D = c(54, 64))

to those of df1 using rbind():

rbind(df1, df3)
   A  B  C  D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44
5 51 52 53 54
6 61 62 63 64

4.5.3.3 New columns using other columns

Often you want to create a new variable where you use other values in your dataset. There are a couple of ways to do so. First you can create a new variable and add the calculation on the right hand side of the assignment operator. As an example, suppose that you want to add the log of A to df1. To do so, you can use

df1$logA <- log(df1$A)
df1
   A  B  C  D     logA
1 11 12 13 14 2.397895
2 21 22 23 24 3.044522
3 31 32 33 34 3.433987
4 41 42 43 44 3.713572

Using the with(data, expression, ...) you can avoid the references to the data frame in the calculation. The first argument in the function is the data frame where R will look for the variables used in expression. In other words, with(df1 ...) allows you to eliminate df1$ in your calculation. If you use A in that expression, R knows that this A is a variable included in df1. To add a column to df1 calculated as the ratio of df1$A/df1$B you would use:

df1$ratioAB <- with(df1, A/B)
df1
   A  B  C  D     logA   ratioAB
1 11 12 13 14 2.397895 0.9166667
2 21 22 23 24 3.044522 0.9545455
3 31 32 33 34 3.433987 0.9687500
4 41 42 43 44 3.713572 0.9761905

Without this function, you would have to write

df1$ratioABalt <- df1$A/df1$B
df1
   A  B  C  D     logA   ratioAB ratioABalt
1 11 12 13 14 2.397895 0.9166667  0.9166667
2 21 22 23 24 3.044522 0.9545455  0.9545455
3 31 32 33 34 3.433987 0.9687500  0.9687500
4 41 42 43 44 3.713572 0.9761905  0.9761905

Using with() you have to assign the result of a calculation to the data frame using df$newvar. Using the within() function, you can avoid this. This function has the same arguments as the with() function, but you add the name of the new variable in the expression part. The within() function returns a new data frame which is a copy of the old data frame plus the columns you added in the expression. In other words, the within() function preserves the “old” data frame and you have to assign the result of within() to a new data frame if you want to access these new values. If you are sure you won’t need the old data frame, you can assign the result of within() to that old data frame. Using within() also allows you to add multiple expressions. As an example, suppose you want to add the sum of A and B as well as the difference between D and C to the data frame (note the {} and the fact that every new variable has a new line without a comma at the end of the line):

dfnew1 <- within(df1, {
  sumAB <- A + B
  diffDC <- D - C
  })
dfnew1
   A  B  C  D     logA   ratioAB ratioABalt diffDC sumAB
1 11 12 13 14 2.397895 0.9166667  0.9166667      1    23
2 21 22 23 24 3.044522 0.9545455  0.9545455      1    43
3 31 32 33 34 3.433987 0.9687500  0.9687500      1    63
4 41 42 43 44 3.713572 0.9761905  0.9761905      1    83

If you assign the results to an existing variable, within() overwrites this variable:

dfnew2 <- within(df1, {
  A <- A / 10
  B <- B * 10
  C <- C / D
  })
dfnew2
    A   B         C  D     logA   ratioAB ratioABalt
1 1.1 120 0.9285714 14 2.397895 0.9166667  0.9166667
2 2.1 220 0.9583333 24 3.044522 0.9545455  0.9545455
3 3.1 320 0.9705882 34 3.433987 0.9687500  0.9687500
4 4.1 420 0.9772727 44 3.713572 0.9761905  0.9761905

Note that you need to be careful when you design the sequance of expressions. For instance, if you first change A, and then use the value of A in your expression for B, R will use the new values for A as it doesn’t recall what the values of A where before you changed them.

4.5.3.4 Deleting rows of columns

To delete rows and columns, you can use the familiar way. For instance, you can use

  • negative subsetting to remove column “C”
df4 <- df1[-3]
df4
   A  B  D     logA   ratioAB ratioABalt
1 11 12 14 2.397895 0.9166667  0.9166667
2 21 22 24 3.044522 0.9545455  0.9545455
3 31 32 34 3.433987 0.9687500  0.9687500
4 41 42 44 3.713572 0.9761905  0.9761905
  • assigning NULL to remove column “B”
df1$B <- NULL
df1
   A  C  D     logA   ratioAB ratioABalt
1 11 13 14 2.397895 0.9166667  0.9166667
2 21 23 24 3.044522 0.9545455  0.9545455
3 31 33 34 3.433987 0.9687500  0.9687500
4 41 43 44 3.713572 0.9761905  0.9761905
  • logical condition to remove all observations that return FALSE (e.g. all observations for variable A that are not equal to 31):
df1[df1$A == 31, ]
   A  C  D     logA ratioAB ratioABalt
3 31 33 34 3.433987 0.96875    0.96875

Using the within() function, you can use the <- NULL to delete multiple columns from your data frame:

dfnew1 <- within(dfnew1, {
  A <- NULL
  ratioAB <- NULL
  ratioABalt <- NULL
  sumAB <- NULL
  })
dfnew1
   B  C  D     logA diffDC
1 12 13 14 2.397895      1
2 22 23 24 3.044522      1
3 32 33 34 3.433987      1
4 42 43 44 3.713572      1

4.5.4 Data frames and functions

There is little difference between the approach you use to functions on a data frame and those for vectors, matrices or lists. This shouldn’t come as a surprise as a data frame is a list which characteristics of a matrix and R functions are vectorized. A couple of examples to illustrate some functions:

  • a summary of a date frame:
summary((df1))
       A              C              D             logA          ratioAB      
 Min.   :11.0   Min.   :13.0   Min.   :14.0   Min.   :2.398   Min.   :0.9167  
 1st Qu.:18.5   1st Qu.:20.5   1st Qu.:21.5   1st Qu.:2.883   1st Qu.:0.9451  
 Median :26.0   Median :28.0   Median :29.0   Median :3.239   Median :0.9616  
 Mean   :26.0   Mean   :28.0   Mean   :29.0   Mean   :3.147   Mean   :0.9540  
 3rd Qu.:33.5   3rd Qu.:35.5   3rd Qu.:36.5   3rd Qu.:3.504   3rd Qu.:0.9706  
 Max.   :41.0   Max.   :43.0   Max.   :44.0   Max.   :3.714   Max.   :0.9762  
   ratioABalt    
 Min.   :0.9167  
 1st Qu.:0.9451  
 Median :0.9616  
 Mean   :0.9540  
 3rd Qu.:0.9706  
 Max.   :0.9762  
  • means per column
colMeans(df1)
         A          C          D       logA    ratioAB ratioABalt 
26.0000000 28.0000000 29.0000000  3.1474942  0.9540381  0.9540381 
  • means per row:
rowMeans(df1)
[1]  7.038538 12.158936 17.228581 22.277659
  • total sum per column:
colSums(df1)
         A          C          D       logA    ratioAB ratioABalt 
104.000000 112.000000 116.000000  12.589977   3.816153   3.816153 
  • total sum per row:
rowSums(df1)
[1]  42.23123  72.95361 103.37149 133.66595
  • apply() function: mean per column:
apply(df1, 2, mean)
         A          C          D       logA    ratioAB ratioABalt 
26.0000000 28.0000000 29.0000000  3.1474942  0.9540381  0.9540381 
  • lapply() function: standard deviation per column:
lapply(df1, \(x) sd(x))
$A
[1] 12.90994

$C
[1] 12.90994

$D
[1] 12.90994

$logA
[1] 0.5700948

$ratioAB
[1] 0.02648301

$ratioABalt
[1] 0.02648301
  • sapply() function: maximum per column:
sapply(df1, \(x) max(x))
         A          C          D       logA    ratioAB ratioABalt 
41.0000000 43.0000000 44.0000000  3.7135721  0.9761905  0.9761905 

Create a 20x3 matrix mat with rownames obs_1 … and variable names var_1 … whose values are drawn from a uniform distribution with minimum 50 and maximum 100:

Code
rn <- paste("obs", 1:20, sep = "_")
cn <- paste("var", 1:3, sep = "_")
mat <- matrix(runif(60, 50, 100), 20, 3, dimnames = list(rn, cn))

Create a data frame mat_df and a tibble mat_tb. Note that for the tibble, you need to include tibble::

mat_df <- as.data.frame(mat)
mat_tb <- tibble::as_tibble(mat)

Extract the column var_1 from both and store in col_df and col_tb using the $ operator:

Code
col_df <- mat_df$var_1
col_tb <- mat_tb$var_1

Check the class of both these columns you extracted:

Code
typeof(col_df)
[1] "double"
Code
typeof(col_tb)
[1] "double"

Let’s now use a real dataset, mtcars, which is part of your R installation. Assign this dataset to a data frame df:

Code
df <- mtcars

Use df to create a tibble tb of mtcars:

Code
tb <- tibble::as_tibble(df)

Print both datasets by running only their name

  • data frame
Code
df
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
  • tibble:
Code
tb
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

What is the difference in result between a data frame and a tibble?

Tell R to keep the row names from df when it creates the tibble tb and store the results in models:

Code
tb <- tibble::as_tibble(df, rownames = "models")
tb
# A tibble: 32 × 12
   models        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Extract the column hp from the df data frame and assign this column to a variable hp

Code
hp <- df$hp

Extract the column disp from the tibble tb using the [] operator. Assign this column to a variable disp:

Code
disp <- tb["disp"]

If you ask R to print this variable (do this in the console) what do you expect will happen: R prints all lines or R prints the first 10 lines?

Extract from df the observations for cars that include a digit at the end of their name (e.g. Duster 360, Mazda RX4):

Code
pat <- "\\d+$"
df[grepl(pattern = pat, x = row.names(df)), ]
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Datsun 710    22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Duster 360    14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 230      22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Fiat 128      32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Camaro Z28    13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Fiat X1-9     27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Do the same, but now, use the tibble tb

Code
pat <- "\\d+$"
tb[grepl(pattern = pat, x = tb$models), ]
# A tibble: 9 × 12
  models         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6 160     110  3.9   2.62  16.5     0     1     4     4
2 Datsun 710    22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
3 Duster 360    14.3     8 360     245  3.21  3.57  15.8     0     0     3     4
4 Merc 230      22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
5 Merc 280      19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4
6 Fiat 128      32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
7 Camaro Z28    13.3     8 350     245  3.73  3.84  15.4     0     0     3     4
8 Fiat X1-9     27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
9 Porsche 914…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2

Extract all observations from df whose am == 1:

Code
df[df$am == 1, ]
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Add a new variable to the tibble, tb$mpg_cyl, calculated as the ratio of the variable mpg and cyl:

tb$mpg_cyl <- with(tb, mpg/cyl)
tb
# A tibble: 32 × 13
   models        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 1 more variable: mpg_cyl <dbl>

Use the within() function to add 3 columns to df: mgp/cyl, mgp/hp and mpg/disp. Store these in mpg_cyl, mpg_hp and mpg_disp. Overwrite df and show the first 5 lines of this new data frame using `head(x, n = 5):

Code
df <- within(df, {
  mpg_cyl <- mpg/cyl
  mpg_hp <- mpg/hp
  mpg_disp <- mpg/disp
})
head(df, n = 5)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb   mpg_disp
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.13125000
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.13125000
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.21111111
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.08294574
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.05194444
                     mpg_hp  mpg_cyl
Mazda RX4         0.1909091 3.500000
Mazda RX4 Wag     0.1909091 3.500000
Datsun 710        0.2451613 5.700000
Hornet 4 Drive    0.1945455 3.566667
Hornet Sportabout 0.1068571 2.337500

Use apply() to calculate the mean per variable in df:

Code
apply(df, 2, mean, na.rm = TRUE)
        mpg         cyl        disp          hp        drat          wt 
 20.0906250   6.1875000 230.7218750 146.6875000   3.5965625   3.2172500 
       qsec          vs          am        gear        carb    mpg_disp 
 17.8487500   0.4375000   0.4062500   3.6875000   2.8125000   0.1398688 
     mpg_hp     mpg_cyl 
  0.1905456   3.8369792 

Do the same, but now for the tibble tb:

Code
cond <- sapply(tb, \(x) is.numeric(x))
apply(tb[cond], 2, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb    mpg_cyl 
  0.437500   0.406250   3.687500   2.812500   3.836979 

Ask for a summary table of df

Code
summary(df)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb          mpg_disp      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000   Min.   :0.02203  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:0.04956  
 Median :0.0000   Median :4.000   Median :2.000   Median :0.09458  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812   Mean   :0.13987  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:0.17740  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000   Max.   :0.47679  
     mpg_hp           mpg_cyl     
 Min.   :0.04478   Min.   :1.300  
 1st Qu.:0.08944   1st Qu.:1.928  
 Median :0.15041   Median :3.108  
 Mean   :0.19055   Mean   :3.837  
 3rd Qu.:0.24129   3rd Qu.:5.700  
 Max.   :0.58462   Max.   :8.475  

Predict the outcome if you would run summary(tb). Create the same table as the result of df but now for tb:

Code
summary(tb[cond])
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb          mpg_cyl     
 Min.   :0.0000   Min.   :3.000   Min.   :1.000   Min.   :1.300  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.928  
 Median :0.0000   Median :4.000   Median :2.000   Median :3.108  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812   Mean   :3.837  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.700  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000   Max.   :8.475  

4.6 Time series

4.6.1 Introduction

Time series are special as their observations are observed, measured or recorded at the specific moment in time (a date or a data/time). In economics and management, a lost of data come in the form of a time series: sales are measured per month quarter or year, accounting data refers to a specific year, semester of quarter, stock prices are recorded by day, hour or minute, inflation or unemployment are usually reported on a monthly basis. This property has a couple of consequences. First, these observations are ordered. The data or time allows to say which observation comes first, which second and which observation comes last. Extracting observations from a time needs to preserve this property. Second, most time frames can be aggregated. For instance, a week is a aggregation of ways, a year an aggregation of quarters, months, weeks or days and an hour is an aggregation of minutes. In order words, you can start from a monthly time series and generate a yearly series. How you do so depends on the series. For instance, you can add 4 quarters of sales to calculate yearly sales. However, this is not the case for, e.g. stock market prices where the sum of prices across time doesn’t make sense. Here, you would need another measure e.g. the price at the end of the last hour of trading as your price for the day or the last price at the end of the month for a monthly series with stock market prices. Third, time can be regular or irregular. If time is regular, then you measure something at evenly spaced moments in time: every month, every year of every minute. If time series are irregular, this is not the case. For instance, if you measure the noise generated by departing airplanes in areas close to the airport, you’ll have measure each time an airplane takes off. Here, you time will show irregular intervals.

In addition to pure time series, a lot of datasets include both cross sections (e.g. firms) as well as time series (e.g. sales per year). This is called a panel dataset: for every firm, country, household, … in your dataset, you observe variables at multiple times e.g. on observation for every year for the last 10 years. If you have a dataset that includes sales data for 50 products, you panel dataset includes 500 observations: for every product, you have 10 observations: one per year for each of the 10 years in your dataset.

To handle time series, R includes the ts() class. This class is uses regular time intervals. In addition, there are many packages that extend the ability of R to use time series e.g. {zoo} or {xts}. These packages also allow irregular time intervals. The time series equivalent of a tibble is called a tsibble and is used in the {tsibble} package (Wang, Cook, and Hyndman (2020)). This package allows you to change, mutate or time series data. Using these formats, packages such as {quantmod}, {tidyfinance}, {forecast} or {econometrics} all use these formats to e.g. develop quantitative trading strategies ({quantmod}), analyse financial data ({tidyfinance}), develop forecasts ({forecast}) or estimate regressions including methods for time series ({econometrics}).

4.6.2 The basics

In this section we will use base R’s ts() as wel as the {xts} (eXtendible time series) package. The latter automatically installs {zoo}. To install {xts} you run

if (!require("xts")) install.packages("xts")
Loading required package: xts
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

4.6.2.1 Creating a time series: ts()

To create a time series, you need to include both the data as well as the date/time values. With respect to the first, let’s create a vector with 25 values drawn as a sequence starting at 10 in steps of 10:

data <- seq(10, by = 10, length.out = 25)

Note that data could also include a matrix or a data frame. We now want to create a time series. To do so, we need to add the “data/time” dimension. Using base R’s ts() function, you can add a start, end and a frequency. The start is included as a value or a vector. For instance start = 2001 is start the series in 2001, start = c(2001, 1) will start the series in 2001-01. The frequency shows the sampling frequency of the time series, 1 would refer to year, 4 refers to a quarterly data and 12 to monthly. Specifying the start and frequency allows R to determine the end date from the length of the series. Let’s create a yearly time series for the values in data starting in 2000. To do so, we use:

ts_data_year <- ts(data, start = 2000, frequency = 1)

If you print the series,

ts_data_year
Time Series:
Start = 2000 
End = 2024 
Frequency = 1 
 [1]  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190
[20] 200 210 220 230 240 250

you see that R created a time series with start in 2000, end in 2024 with frequency equal to 1, i.e. yearly.

To create quarterly data, you can use

ts_data_quar <- ts(data, start = c(2015, 1), frequency = 4)
ts_data_quar
     Qtr1 Qtr2 Qtr3 Qtr4
2015   10   20   30   40
2016   50   60   70   80
2017   90  100  110  120
2018  130  140  150  160
2019  170  180  190  200
2020  210  220  230  240
2021  250               

R adds the reference to quarters and determines the final quarter from the length of the data. To create a monthly series starting in june, you change the frequency to 12 and change the start month:

ts_data_mont <- ts(data, start = c(2023, 6), frequency = 12)
ts_data_mont
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2023                      10  20  30  40  50  60  70
2024  80  90 100 110 120 130 140 150 160 170 180 190
2025 200 210 220 230 240 250                        

You can verify that these series are time series using class(). For instance, to check of ts_data_mont is a time series, you use

class(ts_data_mont)
[1] "ts"

You can extend this example to e.g. matrices. In that case, data will be a matrix.

4.6.2.2 The {xts} package

Fist let’s load the package

library(xts)

As you can see, this package loads another package {zoo}. This is because the {xts} relies on some of the functions in the {zoo} package.

Let’s now create a date/time variable using using seq.POSIXt() with length 25 (consistent with the the length of data) and in intervals of months:

datetime <- seq.POSIXt(from = as.POSIXct("2022-03-25"), length.out = 25, by = "months")

Creating an {xts} object now uses as.xts(). The first argument if this function is the dataset, in this case data. The second argument is the date/time variable:

data_xts <- as.xts(data, datetime)

Inspecting this time series, shows that the dates/times are added to data as row names and in the usual ISO format: %Y-%m-%d.

data_xts
           [,1]
2022-03-25   10
2022-04-25   20
2022-05-25   30
2022-06-25   40
2022-07-25   50
2022-08-25   60
2022-09-25   70
2022-10-25   80
2022-11-25   90
2022-12-25  100
2023-01-25  110
2023-02-25  120
2023-03-25  130
2023-04-25  140
2023-05-25  150
2023-06-25  160
2023-07-25  170
2023-08-25  180
2023-09-25  190
2023-10-25  200
2023-11-25  210
2023-12-25  220
2024-01-25  230
2024-02-25  240
2024-03-25  250

The class of the time series data_xts is “xts”, “zoo”. The latter is included because the former builds on the latter.

class(data_xts)
[1] "xts" "zoo"

Note that the data in {xts} are essentially matrices. In other words, and {xts} object can not store more than one variable type. For most applications, this is usually not too much of an issue. However, if you data includes a mix of types, you’ll need to store the numeric variables in a separate data set.

4.6.2.3 Coercing other structures into a time series

You can coerce other data structures into a time series object. To illustrate this, let’s first create two other objects: a 50x4 matrix and a data frame. Let’s first create a matrix with values and add a matrix with 50 monthly dates.

mat1 <- matrix(runif(200, min = 50, max = 100), 50, 4)

colnames(mat1) <- paste("var", 1:4, sep = "_")
rownames(mat1) <- paste("obs", 1:50, sep = "_")

mat_dates <- seq.POSIXt(from = as.POSIXct("2020-03-25"), 
                        length.out = 50, 
                        by = "months")

mat <- cbind(mat_dates, mat1)

Recall that a matrix is a homogeneous structure. In other words, the dates will be converted into numeric format.

Using this matrix, we can create a data frame. Here, we can add various types of data.

mat1_df <- as.data.frame(mat1, row.names = rownames(mat1))
mat_df <- cbind(mat_dates, mat1_df)

Note that in this case, the column with dates is shown as a date/time variable. Let’s now use the {xts} package to coerce both in a time series format. The as.xts() function has multiple arguments: as.xts(x, order.by, dateFormat = "POSIXct", ...). The first, x is the matrix or data frame. The second, order.by = should include a variable that allows R to order the values in x. The dataFormat argument allows you to change the format from the default POSIXct to e.g. Date. Let’s use this function to change the matrix into a time series:

mat_ts <- as.xts(mat, order.by = as.POSIXct(mat[, 1], format = "%Y-%m-%d"))
head(mat_ts, 5)
            mat_dates    var_1    var_2    var_3    var_4
2020-03-25 1585090800 61.81532 71.34831 84.25773 87.37033
2020-04-25 1587765600 83.67822 91.79865 79.53507 90.04232
2020-05-25 1590357600 73.80289 58.45403 62.74126 71.43401
2020-06-25 1593036000 62.37750 72.38072 71.67449 92.25952
2020-07-25 1595628000 64.98585 65.08229 91.09376 85.55991

The function returns a time series, where it used the dates in mat_dates in the first column of mat to add date/time values to the matrix. In doing to, it kept mat_dates as a separate numeric variable in the data set.

The data frame includes the data/time variable as a POSIXct type. In other words, the time series includes the date as a separate variable. As a result, you don’t need to coerce that variable into a date in the as.xts() functions. It if sufficient to include it in the order.by = argument:

mat_dfts <- as.xts(mat_df, order.by = mat_df$mat_dates)
head(mat_dfts, 5)
            mat_dates    var_1    var_2    var_3    var_4
2020-03-25 2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 2020-07-25 64.98585 65.08229 91.09376 85.55991

Note that here too, R kept the mat_dates variable in the time series dataset. However, here you are including various data types in an xts object. Recall that these objects are essentially matrices. R will change the type of these variables. To avoid that, you need to exclude this mat_dates variable from the coercion:

mat_dfts <- as.xts(mat_df[, 2:5], order.by = mat_df$mat_dates)
head(mat_dfts, 5)
              var_1    var_2    var_3    var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 64.98585 65.08229 91.09376 85.55991

4.6.3 Subsetting

Let’s now use mat_dfts to extract specific variables. Most subsetting approaches that we covered for other data structures can be used for xts time series as well. Note that here, if you use the preserving subsetting operator [], the result will always show the relevant data/time as R preserves the structure of the dataset. For example:

  • extracting columns 2 to 3:
head(mat_dfts[, 2:3], n = 5)
              var_2    var_3
2020-03-25 71.34831 84.25773
2020-04-25 91.79865 79.53507
2020-05-25 58.45403 62.74126
2020-06-25 72.38072 71.67449
2020-07-25 65.08229 91.09376
  • extracting all columns but the first:
head(mat_dfts[, -1], n = 5)
              var_2    var_3    var_4
2020-03-25 71.34831 84.25773 87.37033
2020-04-25 91.79865 79.53507 90.04232
2020-05-25 58.45403 62.74126 71.43401
2020-06-25 72.38072 71.67449 92.25952
2020-07-25 65.08229 91.09376 85.55991
  • extracting the values for the 4th row:
mat_dfts[4, ]
             var_1    var_2    var_3    var_4
2020-06-25 62.3775 72.38072 71.67449 92.25952

Using the $ operator, you can extract variables, e.g.

head(mat_dfts$var_1, n = 10)
              var_1
2020-03-25 61.81532
2020-04-25 83.67822
2020-05-25 73.80289
2020-06-25 62.37750
2020-07-25 64.98585
2020-08-25 52.63511
2020-09-25 66.57814
2020-10-25 66.31016
2020-11-25 53.81508
2020-12-25 50.25629

In addition, and specifically for time series, you can use the date/times to extract specific components. For instance:

  • a specific date:
mat_dfts["2020-07-25"]
              var_1    var_2    var_3    var_4
2020-07-25 64.98585 65.08229 91.09376 85.55991
  • a range of dates using [“start/end”]:
mat_dfts["2020-03-25/2020-07-25"]
              var_1    var_2    var_3    var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 64.98585 65.08229 91.09376 85.55991
  • from the beginning of the series to date [“/end”]:
mat_dfts["/2020-07-25"]
              var_1    var_2    var_3    var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 64.98585 65.08229 91.09376 85.55991
  • from start to the last date [“start/”]:
mat_dfts["2023-12-25/"]
              var_1    var_2    var_3    var_4
2023-12-25 76.81282 78.46668 88.45526 82.58745
2024-01-25 91.99313 69.49864 60.66829 86.80344
2024-02-25 80.60366 63.13936 74.42565 97.02449
2024-03-25 81.18256 90.97987 52.96503 61.63764
2024-04-25 76.47097 99.46359 91.42821 96.14635
  • an entire year [“2022”]:
mat_dfts["2022"]
              var_1    var_2    var_3    var_4
2022-01-25 83.15106 77.58532 88.85743 91.75516
2022-02-25 50.57743 93.83828 59.57863 56.19557
2022-03-25 92.09622 54.00683 66.53835 66.70615
2022-04-25 62.74961 58.89334 56.34485 68.11273
2022-05-25 88.70798 87.75809 65.69200 57.85879
2022-06-25 72.01018 79.88379 61.53902 95.35869
2022-07-25 94.04675 66.27377 75.93474 86.22792
2022-08-25 52.57497 91.33814 85.01536 73.57971
2022-09-25 51.21142 71.47336 53.46573 77.88881
2022-10-25 70.93977 65.64055 70.35639 55.68449
2022-11-25 80.27070 94.45147 75.94056 55.40697
2022-12-25 58.05967 74.95672 81.39050 50.77193

If you have daily data for instance, you can plot a single month adding [“2022-03”]. Here you will extract all values for the month March in 2022.

Using first() and last() you can extract the first x weeks of the dataset by including x weeks in the function first() and the last y months by including y months in the function last(). Note that you can refer to weeks even if the periodicity of the dataset is monthly. R will extract the all months within this x week period. Valid periods are seconds, minutes, hours, days, weeks, months, quarters and years. For instance:

  • extract the data for the first 2 quarters in the dataset (here the first quarter includes only 1 month):
first(mat_dfts, "2 quarters")
              var_1    var_2    var_3    var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
  • extract the last 2 quarters (note that here the last quarter includes only one month):
last(mat_dfts, "2 quarters")
              var_1    var_2    var_3    var_4
2024-01-25 91.99313 69.49864 60.66829 86.80344
2024-02-25 80.60366 63.13936 74.42565 97.02449
2024-03-25 81.18256 90.97987 52.96503 61.63764
2024-04-25 76.47097 99.46359 91.42821 96.14635

Combining first() and last():

  • extract the first 3 months of the last 4 quarters:
first(last(mat_dfts, "4 quarters"), "3 months")
              var_1    var_2    var_3    var_4
2023-07-25 60.92104 76.31153 93.02249 97.34942
2023-08-25 86.19620 78.71048 65.48694 94.27129
2023-09-25 80.78298 81.12048 95.89146 92.30111

Recall that mat_dfts includes a monthly time series. You can determine the endpoints for another time interval, e.g. quarter or year. Doing so, R selects the last observations per quarter or per year. In addition to year and quarter, you can also determine the endpoints for months, hours and minutes. Using these endpoints, you can extract the data for these moments.

Let’s first determine the endpoints per year (i.e. the last observations for a year):

end_year <- endpoints(mat_dfts, on = "year")
end_year
[1]  0 10 22 34 46 50

These observations are included on the 10th row, the 22th row, … . Using this vector to subset the time series now allows to extract the values for all variables in mat_dfts:

mat_dfts[end_year]
              var_1    var_2    var_3    var_4
2020-12-25 50.25629 71.80983 54.86632 51.58079
2021-12-25 73.69827 60.67609 59.01273 80.17038
2022-12-25 58.05967 74.95672 81.39050 50.77193
2023-12-25 76.81282 78.46668 88.45526 82.58745
2024-04-25 76.47097 99.46359 91.42821 96.14635

There are two special functions that allow you to extract the core data and the index. The first refers to all variables, other than the date/time index. To extract these variable, you use the coredata() function:

core <- coredata(mat_dfts)
head(core, n = 5)
        var_1    var_2    var_3    var_4
[1,] 61.81532 71.34831 84.25773 87.37033
[2,] 83.67822 91.79865 79.53507 90.04232
[3,] 73.80289 58.45403 62.74126 71.43401
[4,] 62.37750 72.38072 71.67449 92.25952
[5,] 64.98585 65.08229 91.09376 85.55991

The index refers to the date/time index. Using the index() function allows you to extract these values:

datetime <- index(mat_dfts)
head(datetime, n = 5)
[1] "2020-03-25 CET"  "2020-04-25 CEST" "2020-05-25 CEST" "2020-06-25 CEST"
[5] "2020-07-25 CEST"

4.6.4 Time series functions

4.6.4.1 Data on the time series

Counting the number of months, quarters or years in a time series dataset can be done using nmonths(), nquarters() or nyears(). For instance, mat_dfts includes:

nmonths(mat_dfts)
[1] 50

50 months,

nquarters(mat_dfts)
[1] 18

18 quarters and

nyears(mat_dfts)
[1] 5

5 years of data.

Note that the here, the first and last of these five years doesn’t include data for all 12 months in that year.

You can determine the periodicity (e.g. monthly, yearly, hourly) using periodicity(). The function estimates the frequency of the time series observations:

periodicity(mat_dfts)
Monthly periodicity from 2020-03-25 to 2024-04-25 

4.6.4.2 Lags and leads

In addition to the function we have introduced for other data structures, there are a couple of function specific to time series. The first function is lag(x, k). This function computes the lagged version of a time series. For instance, with k = 1 the lag of a monthly series shifts the series one month back in time. In doing so, the observation for the lag of march 2025 is february 2025. This allows you to compute the difference between to observations across time. The default value for k = 1. Changing this to e.g. 12 for a monthly series computes the value for the same variable 12 months ago. Because the first k observations are missing, R changes these values into NA. For instance, to determine the monthly change in all variables included in mat_dfts

mat_lag1 <- mat_dfts - lag(mat_dfts, k = 1)
head(mat_lag1, 5)
                var_1      var_2      var_3      var_4
2020-03-25         NA         NA         NA         NA
2020-04-25  21.862903  20.450336  -4.722668   2.671986
2020-05-25  -9.875329 -33.344624 -16.793810 -18.608307
2020-06-25 -11.425395  13.926690   8.933233  20.825511
2020-07-25   2.608355  -7.298423  19.419271  -6.699614

If you change k = 1 into k = 12 calculated the change relative to the same month in the previous year. This is often referred to as Year of Year (YoY) changes:

mat_lag12 <- mat_dfts - lag(mat_dfts, k = 12)
last(mat_lag12, 5)
                var_1      var_2       var_3      var_4
2023-12-25  18.753147   3.509959   7.0647632  31.815517
2024-01-25   7.487500 -28.240115  -4.3133763  36.497257
2024-02-25 -18.817803 -31.976606   0.2734934  16.111089
2024-03-25   7.618603   9.233648 -21.6129541 -34.376052
2024-04-25   6.163255   9.653780  29.0678414  -1.707843

Using diff(x, lag = 1, differences = 1) allows you to calculate similar differences. The lag = 1 arguments specifies the lag and is simular to the k = 1 argument in the lag() function. The differences = 1 argument allows you to specify the order of the differencing. The first order (by default) calculate the difference between the levels. The second order calculate the difference in the differences (i.e. second derivative). To illustrate:

mat_dif1 <- diff(mat_dfts, lag = 1, differences = 1)
head(mat_dif1, 5)
                var_1      var_2      var_3      var_4
2020-03-25         NA         NA         NA         NA
2020-04-25  21.862903  20.450336  -4.722668   2.671986
2020-05-25  -9.875329 -33.344624 -16.793810 -18.608307
2020-06-25 -11.425395  13.926690   8.933233  20.825511
2020-07-25   2.608355  -7.298423  19.419271  -6.699614

calculates the same change as x - lag(x, k = 1). However,

mat_dif2 <- diff(mat_dfts, lag = 1, differences = 2)
head(mat_dif2, n = 5)
                var_1     var_2     var_3     var_4
2020-03-25         NA        NA        NA        NA
2020-04-25         NA        NA        NA        NA
2020-05-25 -31.738232 -53.79496 -12.07114 -21.28029
2020-06-25  -1.550066  47.27131  25.72704  39.43382
2020-07-25  14.033750 -21.22511  10.48604 -27.52512

calculates the change in the difference: the difference in the difference of the second order difference.

4.6.4.3 period.apply

Recall the apply function for matrices. The period.apply() function has a similar use for time series. The function requires an xts object, an index and a function. The index needs to define non-overlapping intervals. The endpoints() function is an example that allows you to specify these intervals. You can also specify your own vector. As long as it starts and ends with the number of rows in the xts object and includes non overlapping intervals. The period.apply() function will then apply a function to all observations within an interval. For instance, recall that endpoints returns a vector with index breakpoints:

end_year <- endpoints(mat_dfts, on = "years")
end_year
[1]  0 10 22 34 46 50

The first inverval runs from 0 to the 10th observations. The second yearly interval from the 11th to the 22th observation, … . You can now use period.apply() to calculate e.g. the mean for every year:

period.apply(mat_dfts, INDEX = end_year, FUN = colMeans)
              var_1    var_2    var_3    var_4
2020-12-25 63.62546 71.06288 77.21998 81.80398
2021-12-25 75.99454 74.77712 72.00702 77.25454
2022-12-25 71.36631 76.34164 70.05446 69.62891
2023-12-25 78.67123 82.97227 78.22946 75.37588
2024-04-25 82.56258 80.77036 69.87180 85.40298

As you can see, this code returns the mean value per year for all 4 variables. If it would make more sense to calculate the sum, e.g.

period.apply(mat_dfts, end_year, colSums)
              var_1    var_2    var_3    var_4
2020-12-25 636.2546 710.6288 772.1998 818.0398
2021-12-25 911.9344 897.3254 864.0843 927.0545
2022-12-25 856.3958 916.0997 840.6536 835.5469
2023-12-25 944.0548 995.6672 938.7536 904.5106
2024-04-25 330.2503 323.0815 279.4872 341.6119

More in general, for every non-overlapping periode in the INDEX, the function period.apply() will apply the function in FUN. The index is a vector with positions what show the end points of every interval. For instance c(0, 3, 6, 9) would introduce intervals covering the first 3 observations, observations 4, 5 and 6, observations 7, 8 and 9, … . For each of these three observations, R would then apply the function in FUN. If this function is colMeans, it would apply, for every variable in the dataset, this function to every time interval and colSums calculates, for every variable in the dataset, the sum of the three components in each of the time intervals.

Make sure that the {xts} package is loaded.

Create a 104x2 matrix data with column names high and low and values runif(104, 100, 200) and runif(104, 10, 20):

Code
data <- matrix(c(runif(104, 100, 200), runif(104, 10, 20)), 104, 2)
colnames(data) <- c("high", "low")

Add a weekly time sequence starting 2023-01-01 with 104 weeks and assign the value weeks and add this variable to the data matrix:

Code
weeks <- seq.POSIXt(from = as.POSIXct("2023-01-01", format = "%Y-%m-%d", tz = "UTC"), length.out = 104, by = "weeks")
data <- cbind(weeks, data)

Add both in an xts object datats and remove the weeks column:

Code
datats <- as.xts(data, order.by = as.POSIXct(data[, 1]))
datats <- datats[, -1]

Determine the periodicity of datats as well as the number of months and years:

Code
periodicity(datats)
Weekly periodicity from 2023-01-01 01:00:00 to 2024-12-22 01:00:00 
Code
nmonths(datats)
[1] 24
Code
nyears(datats)
[1] 2

Determine the quarterly end ponts

Code
end_quar <- endpoints(datats, on = "quarter")

Use the period.apply() function to calculate the sum per quarter of the variables in datats. Store the results in datatsq

Code
datatsq <- period.apply(datats, end_quar, colSums)
datatsq
                        high      low
2023-03-26 01:00:00 1873.203 185.6338
2023-06-25 02:00:00 2058.894 211.4065
2023-09-24 02:00:00 1871.825 197.9169
2023-12-31 01:00:00 2219.496 180.3415
2024-03-31 01:00:00 1924.175 191.3454
2024-06-30 02:00:00 2034.071 178.1488
2024-09-29 02:00:00 1901.548 199.3277
2024-12-22 01:00:00 1931.220 163.2491

Calculate the monthly difference for the variables in datats and store the results in diff_datats:

Code
diff_datats <- diff(datats, lag = 1, difference = 1)

Use lag() to calculate the percentage change in high and store as pct_high:

Code
pct_high <- (datats$high - lag(datats$high))/lag(datats$high)

4.7 data tables

A data.table is an enhanced data.frame. This data structure allows you to e.g. search for data inside the table using SQL-type formatting. To uses this data structure, you need to install and load the {data.table} package. To do so, you first install the package (if you haven’t done so yet)

install.packages("data.table")

and load the package

library(data.table)
Warning: package 'data.table' was built under R version 4.4.3

Attaching package: 'data.table'
The following objects are masked from 'package:xts':

    first, last
The following objects are masked from 'package:zoo':

    yearmon, yearqtr

Here, we will not cover data.tables in depth, but give a couple of examples on how it differs from the traditional data.frame. These examples will show why a data.table is usually faster than a data.frame, especially on large datasets. If you need to work with very large datasets, you can use e.g. Barrett et al. (2025) as a starting point for introduction to this data structure.

Let’s first create a data.table. You’ll see that the basic syntax is comparable to the usual data.frame() syntax:

dt <- data.table(
  firm = LETTERS[1:25],
  sales = runif(25, 100, 200), 
  margin = rnorm(25, 10, 2), 
  sector = sample(c("services", "services", "industry", "construction", "transport"), 25, replace = TRUE))
head(dt, 10)
      firm    sales    margin    sector
    <char>    <num>     <num>    <char>
 1:      A 196.4297 11.993541 transport
 2:      B 136.2853  9.378965 transport
 3:      C 111.2189 13.642239 transport
 4:      D 135.9506  9.769425  services
 5:      E 177.5073 11.911084  services
 6:      F 106.0136  8.349874  services
 7:      G 121.1611 13.540675  services
 8:      H 195.2492 11.021530  services
 9:      I 196.5682 10.317356  services
10:      J 191.2377  9.823625  services

This function returns a data.table. As a data.table is an enhanced version of a data.frame, it is also a data.frame.

class(dt)
[1] "data.table" "data.frame"

In other words, you can use all data.frame functions or subsetting rules to data.tables:

head(dt[, 1], 5)
     firm
   <char>
1:      A
2:      B
3:      C
4:      D
5:      E
dt$sales
 [1] 196.4297 136.2853 111.2189 135.9506 177.5073 106.0136 121.1611 195.2492
 [9] 196.5682 191.2377 190.2272 156.6190 145.9440 113.6654 174.5056 198.7093
[17] 188.1009 167.9208 156.7068 173.0615 151.7091 144.0244 106.2058 113.5574
[25] 174.3352
dt[["margin"]]
 [1] 11.993541  9.378965 13.642239  9.769425 11.911084  8.349874 13.540675
 [8] 11.021530 10.317356  9.823625  8.171818 10.430808 12.485047 13.928489
[15]  9.368508  6.896699  9.374375  8.561754 11.633696 11.025556 11.621973
[22]  9.256063  8.135490 10.406770  7.344467
head(dt[sales > 150, 1:3], 5)
     firm    sales    margin
   <char>    <num>     <num>
1:      A 196.4297 11.993541
2:      E 177.5073 11.911084
3:      H 195.2492 11.021530
4:      I 196.5682 10.317356
5:      J 191.2377  9.823625

The result of these 4 subsetting operations return a data.table (preserving operator ([])) or vectors (simplifying operator([[]] or $)). Note that you can use

The data.table includes names (columns) and row names (rows are numbered in this case and numbers are stored as $row.names). The attributed further include the class as well as the location in your memory where R stored the data.table.

attributes(dt)
$names
[1] "firm"   "sales"  "margin" "sector"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x0000018f5a429cd0>

However, in addition to the subsetting rules for data.frames, data.tables allow you to subset observations using dt[i, j, by] where i are the rows to subset or reorder, j refers to a calculation and by refers to a group. Let’s start with i and extract only firms that are in “industry” or in “transport”:

dt_ind <- dt[sector == "industry" | sector == "transport"]
dt_ind
     firm    sales    margin    sector
   <char>    <num>     <num>    <char>
1:      A 196.4297 11.993541 transport
2:      B 136.2853  9.378965 transport
3:      C 111.2189 13.642239 transport
4:      K 190.2272  8.171818 transport
5:      M 145.9440 12.485047 transport
6:      Q 188.1009  9.374375  industry
7:      T 173.0615 11.025556  industry
8:      W 106.2058  8.135490  industry
9:      Y 174.3352  7.344467 transport

As you would expect, R subsets the data.table and extract only those observation where the boolean operation: sector == "industry" | sector == "transport" is TRUE and skips all other observations as the boolean operation returns FALSE. Here, you use subsetting rules that you know from the data.frame section.

Let’s now add a calculation within the subsetting and ask R to calculate the sum, the mean, minimum and maximum values for these two industries. To do so, we use the j position in dt[] where the i position is used to select the industries and the j position is now used to include a calculation. As we have more than one calculation (sum, mean, min and max) we include them within () and add a dot:

dt_ind_sum <- dt[sector == "industry" | sector == "transport", .(sum(sales), mean(sales), min(sales), max(sales))]
dt_ind_sum
         V1       V2       V3       V4
      <num>    <num>    <num>    <num>
1: 1421.809 157.9787 106.2058 196.4297

We now have a data.table with the sum, mean, the minimum and maximum values for sales for these two industries. Note that in this case, subsetting for data.tables and data.frames is different. Within a data.frame, you can not add calculations within the subsetting operators.

To calculate these for each industry, we can now use the by position. If we include by = "sector", R will calculate the sum, mean, min and max for each sector.

dt_ind_sum <- dt[sector == "industry" | sector == "transport", .(sum(sales), mean(sales), min(sales), max(sales)), by = "sector"]
dt_ind_sum
      sector       V1       V2       V3       V4
      <char>    <num>    <num>    <num>    <num>
1: transport 954.4403 159.0734 111.2189 196.4297
2:  industry 467.3682 155.7894 106.2058 188.1009

Here, R reads the dt[] subsetting as: using only industry or sector, calculate the sum, mean, min and max for different value in sector. Here, the only difference values in sector are “industry” or “transport”

dt_ind_sum <- dt[margin > 7.499, .(sum(sales), mean(sales), min(sales), max(sales)), by = "sector"]
dt_ind_sum
         sector        V1       V2       V3       V4
         <char>     <num>    <num>    <num>    <num>
1:    transport  780.1051 156.0210 111.2189 196.4297
2:     services 2032.0241 156.3095 106.0136 196.5682
3: construction  270.3722 135.1861 113.6654 156.7068
4:     industry  467.3682 155.7894 106.2058 188.1009

A data.table further allows you to create new variable for all observations in a data.table, you can use the j position and crate a new variable as a function of the other variables. To do so, you use the name of the new variable followed by a := and the function R needs to apply. For instance, generating a variable gross_profit as the product of sales and margin (where you divide margin by 100 to obtain a percentage):

dt[, gross_profit := sales * (margin/100), ]
head(dt, n = 5)
     firm    sales    margin    sector gross_profit
   <char>    <num>     <num>    <char>        <num>
1:      A 196.4297 11.993541 transport     23.55887
2:      B 136.2853  9.378965 transport     12.78215
3:      C 111.2189 13.642239 transport     15.17274
4:      D 135.9506  9.769425  services     13.28159
5:      E 177.5073 11.911084  services     21.14305

Because you run the operations as creating a new variable, calculating sum, mean or min and max within the subsetting, a data.table is faster, especially on large datasets relative to a data.frame. Using the latter, most of the results shown here would requires R to call functions in other packages. Doing so, slows down the process.